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Preface 



IFIP Working Group 2.5 on Numerical Software has always been con- 
cerned not just with strictly numerical issues, such as details of floating 
point arithmetic or numerical algorithms, but also with the software is- 
sues that affect the viability of scientific computation. Previous working 
conferences sponsored by WG2.5 have included focus on software porta- 
bility, programming languages for numerical software, programming and 
problem solving environments, parallelism, performance evaluation and 
software quality. These issues are, of course, of wider significance be- 
yond numerical software. The solid basis of a practical perspective on 
scientific computation has lead the work of WG2.5 to be pioneering even 
in this wider context. 

This working conference addresses another such topic, one only re- 
cently recognized as a critical aspect of software design, but a topic 
increasingly important as software gets larger and more complex, con- 
structed by teams of people and evolved over decades. This topic is 
software architecture. Software architecture refers to the way software 
is structured to promote objectives such as reusability, maintainability, 
extensibility, feasibility of independent implementation, etc. In the con- 
text of scientific computation, the challenge facing mathematical soft- 
ware practitioners is to design, develop and supply computational com- 
ponents which deliver these objectives when embedded in end-user ap- 
plication codes. 

At one time, scientific computing applications were sufficiently sim- 
ple that it was feasible with each new computational problem for one to 
start from scratch and construct a monolithic program specific to solving 
that particular problem. As these applications became more complex, 
it became necessary to amortize the development effort over many sim- 
ilar computational problems. More significantly, it became necessary to 
be able to take advantage of rare, highly specialized, skills through em- 
ploying off-the-shelf software components implemented by third parties. 
Thus for many years, the software architecture of a scientific computing 
application has typically been that the computation is effected not by 
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just a single program but by the operation of a suite of related programs 
acting on a common database. Within such a suite, the individual pro- 
grams are structured from subprograms. Some of these subprograms are 
obtained from libraries provided by various commercial suppliers, and 
some from public domain sources. A few subprograms will be unique 
to this suite of programs, representing the particular modeled science 
in the suite and the desired sequences of operations to be performed. 
Commonly, the user provides the main program that calls the subpro- 
grams. In some cases, however, the main program is a framework that 
can realize different computations by making appropriate calls to the 
available subprograms, including any user-provided plug-in, and control 
over the sequence of calls and their parameters is provided by the user 
through a problem oriented scripting language. 

Today, new options for architectures for scientific computation are 
becoming available, and some of the older paradigms may need to be 
re-thought. What are the implications of widespread connectivity to 
networks, in terms of the construction of scientific computing applica- 
tions in the future? Do distributed computing models such as CORBA 
and ActiveX, or the RMI (Remote Method Invocation) of Java, form 
an appropriate basis on which to build higher level protocols (e.g., for 
interacting mathematical and geometric objects) in pursuit of the goal 
of "plug-and-play” application inter-operability? Do we need to extend 
the notion of a common database to embrace federated databases, which 
may be geographically or organizationally dispersed across the Web, and 
only intermittently available? How can we exploit concurrency and par- 
allel execution for monitoring, visualization and steering purposes, for 
instance, as well as for straightforward performance? If, as many people 
argue, object-oriented computing provides a more appropriate program- 
ming basis than procedural languages, how can the properties (such as 
reliability, portability, efficiency, etc.) so painstakingly pursued by the 
developers of "traditional” subroutine libraries be preserved? 

With many conferences being held on the topic of softwaxe archi- 
tecture, what is the added value of holding a conference on software 
architecture for scientific computing applications? Why might we antic- 
ipate software architecture for scientific applications might differ from 
software architecture suitable for other situations? What are distinctive 
characteristics of scientific applications? 

Scientific applications often are very large computations that strain 
the resources of whatever computers are available. Clever algorithms 
and data structures may need to be utilized so that the computation 
can be done with acceptable efficiency, or even so that it can be done 
at all. The intended computation is often ill defined, or at least incom- 
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pletely defined, and completion of the specification must be suggested 
by experience, by analogy to other computations, or by heuristics. Data 
types may include not just structures of text and numbers, but the full 
range of multimedia, from sound to simultaneous time series, from still 
images to video. Scientific applications typically require deep scientific 
and domain knowledge, and depend on subtle interplay of different ap- 
proximations. They often implement sophisticated mathematics. In 
many cases, numerical computations need to be carefully arranged not 
just to avoid inaccurate results, but to avoid instability of processes. 
Sensitivity to external input data can be a problem, and inadequate in- 
put data is common. Insightful display of computed quantities may be 
essential for analyzing and interpreting results or for interactive control 
to steer the computation. Scientific computations are often experiments, 
and must be controlled, recorded, and categorized so that they can be 
compared to other experimental observations. 

These characteristics make the implementation of scientific applica- 
tions challenging, but have little direct effect on the architecture chosen 
for the software. Two classes of characteristics do, however, have sig- 
nificant impact on software architectures that are, or should be, used 
for scientific applications. The first of these classes is the characteris- 
tics of the context in which scientific applications are implemented, and 
the second class is the characteristics of the context in which scientific 
applications evolve. Can you suggest others? 

Implementation Context 

1 Developers of scientific applications usually have deep understand- 
ing of the science, and deep domain knowledge. They may have 
deep knowledge of the relevant mathematics. On the other hand, 
they typically don’t see themselves as professional programmers; 
they see themselves as scientists. A consequence of this is that they 
are not familiar with modern software engineering practice and 
knowledge and may not recognize the need for it. As an extreme 
illustration, all too often they produce little or no documentation. 

2 Normally, no formal specification document exists before or even 
after the application is implemented. 

3 The development team often is distributed geographically and or- 
ganizationally, that is, there is no single organizational manage- 
ment in control of the development, and developers eire often vol- 
unteers from diverse organizations. 
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4 Using and supporting shared libraries is a well-established practice 
in this community, as is working with libraries and tools obtained 
from third parties. Today input/output to working and persistent 
backing store and to networked computers must be done in con- 
sistent ways in related application programs, which is most easily 
guaranteed be a common I/O library. 

5 Conservatism of customers and compatibility with existent li- 
braries restrict implementations to older languages, older program- 
ming environments, and older tools, so advances reliant on changes 
in these are unlikely to be accepted. 

6 Performance of scientific applications is often critical, hence elegant 
axchitectmes which imply performance penalties, for example on 
p£irallel computers, will not be accepted. 

7 Mixed language is not uncommon. 

8 Open source is a common form of distribution, with the corollary 
that knowing which specific version is in use is often unclear. 

Evolution Context 

1 Revisions of scientific applications to produce new versions often 
entail changes that are not local. For example, changes to physics 
models tend to change every equation as well as initial conditions 
and boundary values. Thus an architecture that decomposed the 
physical region being modeled into subregions, each handled by a 
separate software module, would require changes to all the mod- 
ules. However, the changes are often systematic and could be 
generated automatically. 

2 Code ownership migrates over time as interested developers move 
to different institutions, or as researchers at new institutions choose 
to contribute. Merging multiple development streams is common. 

3 Formal responsibility for maintenance rarely exists, and corre- 
spondingly scaffolding tools necessary for maintenance and evo- 
lution are generally unsupported. 

4 Regression testing is rare; indeed, facilities for regression testing 
of components or for regression testing of system integration are 
almost unknown. Consequently retesting after enhancement is ex- 
pensive and often omitted. 
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5 Although many developers of scientific applications may believe 
otherwise, software lifetime of such applications are often mea- 
sured in decades, even though not a single line of code may persist 
unchanged. 

The purpose of this meeting, then, was to address questions of soft- 
ware architecture in the context of scientific computing applications by 
bringing together practitioners of scientific computation that has inno- 
vations in software architecture, those with experience in trying the new 
paradigms aad, component vendors who must support them. We need 
to share the experience of what is and what is likely to remain effective 
and how it needs to be expressed. 



Morven Gentleman 
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Foundation for Research and Technology 
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Abstract Rapid advances in modern networking technologies and commodity 
high performance computing systems are leading the field of comput- 
ing in a new paradigm referred to as network-based computing (NC). 
This paradigm views a large number of geographically distributed com- 
puter resources such as PCs, workstations, Symmetric Multi-processors 
(SMP) and Massively Parallel Processing (MPP) systems connected 
through a high speed network as a single meta-computer or compu- 
tational grid [2]. In this paper, we focus on the Internet (WAN) and 
Intranet (LAN) based computational grids and their ability to support 
scalable and robust “deep” computing. We present various implemen- 
tations of the NC paradigm using commodity and customized software 
in the context of the existing PELLPACK problem solving environment 
[8], the ITPACK library as it has been implemented in the PELLPACK 
system, and a muti-physics application for the design of gas turbine en- 
gines [11]. Through this study we attempt to assess the feasibility and 
efficiency of several NC paradigms for scientific applications utilizing 
existing middleware. 

Keywords: network computing, grid computing, deep computing, problem solving 
environment, legacy code, middleware, agent based computing, client- 
server computing, multi-physics application. 
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1. INTRODUCTION 

The Internet has long been used for communication. Until recently, 
there has been little use of the network for actual computations. This 
situation is changing rapidly and will have enormous impact on the fu- 
ture. It is clear that the old slogan “The Network is the Computer” is 
becoming a reality. It will not only change the way we live, work, learn, 
and communicate with each other, but it will change the way we do com- 
putational science and electronic prototyping. In this paper we assess 
the capability and effectiveness of several commodity and customized 
software technologies to support scalable and robust “deep” scientific 
computing on Internet and Intranet based computational grids [2, 6]. 
For this we apply the NC paradigm to a number of applications with 
which we have significant experience. 

In the 1970s, the scientific computing community established the con- 
cept of a software library and introduced procedures for testing and 
disseminating such artifacts, which have evolved into international stan- 
dards. Current information technologies (IT) allow the easy develop- 
ment and integration of GUI interfaces, domain specific textual and vi- 
sual languages, visualization libraries, portable computational libraries, 
knowledge bases and other related technologies. These technologies al- 
ready allow the user to exploit the power of the hardware resources while 
reducing the overhead of specifying and visualizing the results of a simu- 
lation. These developments have led to the concept of a Problem Solving 
Environment (PSE) that promises to provide industrial scientists and en- 
gineers with environments and seamless integration mechanisms allow- 
ing them to spend more time doing science and engineering rather than 
“computing”. Now, the NC paradigm promises to put the PSE technol- 
ogy at the fingertips of any scientist and engineer anytime and anywhere. 
The first NC paradigm that we study is the so-called client/server com- 
puting. In this scenario, software usage is not viewed as a commodity 
but as service. The computational service provider will offer users all 
the resources needed to solve the problem in some “natural” form. In 
this context, we study the concept of client/server computing for PSEs 
and scientific libraries. This concept is demonstrated utilizing elements 
of the PELLPACK library and a “thin” [12] client-server WebPDELab. 

The process of prototyping is part of every scientific inquiry, prod- 
uct design and learning activity. The new economic realities require the 
rapid prototyping of manufactured artifacts and rapid solutions to prob- 
lems with numerous interrelated elements. This, in turn, requires the 
fast, accurate simulation of physical processes and design optimization 
using knowledge and computational models from multiple disciplines 
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(multi-physics and multi-scale models) in science and engineering. The 
realization of rapid multidisciplinary prototyping is the new grand chal- 
lenge. In these applications the software is often distributed geograph- 
ically and the various groups involved have limited knowledge of all 
software components used to simulate the atomic elements involved in 
the prototyping of a composite artifact. In this application scenario, the 
natural computational resource is a “computational grid” or “The Net” 
that connects the needed distributed hardware and software resources 
used to simulate the elements of the artifact. We present the computa- 
tional skeleton of a multi-physics application associated with the simu- 
lation of gas turbine engines and its implementation on a computational 
grid of heterogeneous computational resources, utilizing an agent based 
system known as Grasshopper [9] and Java based commodity software. 
Moreover, we discuss how these two implementations handle issues like 
heterogeneity, scalability, adaptability, and fault tolerance in the context 
of the targeted application scenarios. 

The paper is organized as follows. Section 2 presents a detailed de- 
scription of the thin client server WebPDeLab based on PELLPACK 
PSE and its libraries. Section 3 presents two implementations of a 
multi-physics application related to gas turbine engine simulation uti- 
lizing middleware technologies, the Grasshopper agent platform and a 
Java RMI middleware. In Section 4 we demonstrate two approaches for 
manag i ng the software and computational resources based on locations 
of data, application code, and hardware. In these approaches, the code 
of interest is accessed as a service provided by a computational server. 
The user’s data is sent to the server, the requested operations are per- 
formed, and the result is sent back to the initiator of the computation. 
We use the Java RMI middleware to implement remote computing with 
the ITPACK library and expand the PELLPACK PSE utilizing the Net- 
Solve [4] middleware and its servers. Finally, Section 5 summarizes our 
observations. 

2. WEBPDELAB: A WEB-BASED THIN 

CLIENT-SERVER FOR THE PELLPACK 
PSE 

WebPDELab is a World Wide Web server that allows users to define, 
solve and analyze partial differential equation (PDE) problems using a 
comprehensive graphical user interface from any Java-enabled browser. 
The WebPDELab server is currently supported by a 32 CPU Intel clus- 
ter which allows users to solve PDE problems sequentially or in parallel. 
WebPDELab is the PELLPACK [8, 16] problem-solving environment 
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implemented as an Internet-based client-server application. It provides 
aecess to a library of PDE solvers and an interax;tive graphical interface 
that support the pre-processing and post-processing phases of PDE com- 
puting. The PELLPACK software is implemented as a system of X win- 
dows programs and libraries, compiled on an i86pc SunOS 5.6 machine. 
WebPDELab displays the interface of the PELLPACK software within a 
Java-capable browser using the Virtual Network Computing [13] remote 
display system. The URL for the server is http://www.webpdelab.org. 
In this section, we give a brief overview of PELLPACK system and its 
implementation as a server utilizing the thin client/server technology. 

2.1. THE PELLPACK PROBLEM SOLVING 
ENVIRONMENT 

PELLPACK is a system that allows users to specify and solve PDE 
problems on a target computational platform and to visualize the so- 
lution. PELLPACK provides a graphical user interface for defining the 
PDE model and selecting solution methods (see Figure 3 ), and is sup- 
ported by the Maxima symbolic system and well-known numerical li- 
braries. The graphical interface is implemented on top of a very high 
level PDE language. Users can specify their PDE problem and its so- 
lution visually using the graphical interface or textually using “natu- 
ral” language. PELLPACK has incorporated over 100 solvers of various 
types that cover most of the common PDE applications in 2 and 3 di- 
mensions. In the PELLPACK system, a problem is represented by the 
PDE objects involved; PDE models or equations, domain, conditions 
on the domain boundary, solution methods, and output requirements. 
The PELLPACK interface consists of many graphical tools and support- 
ing software to assist users in building a problem definition. A textual 
specification of these objects comprises PELLPACK’s natural PDE lan- 
guage, and the language representation of each object is generated by 
the object editors/tools. The language definition of a user’s problem 
(the .e file) is automatically passed to PELLPACK’s language proces- 
sor, which translates the problem into a Fortran driver program, and 
then compiles and links it with numerical libraries containing the user- 
specified solver methods. Sequential or parallel program execution is a 
one-step process; the program is executed on one or more machines in 
the supporting i86pc host cluster. Problem solutions are passed to the 
PELLPACK visualization system for solution display and analysis. 
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2.2. THE WEBPDELAB SERVER 

In response to the rising cost of managing distributed desktop based 
software, many organizations are revisiting a more centralized and man- 
aged computing strategy using thin client software and computing. Al- 
though this computing paradigm resembles a return to the days of main- 
frame computing, the new challenge is to handle the communication re- 
quirements for application software using sophisticated graphical user 
interfaces. Delivering raw screen pixels, however, requires bandwidth 
that most of the today’s network environments cannot afford. Thin- 
client systems [12] use a remote display protocol to relay the display 
information from the server to the client efficiently. In [2] there is an 
evaluation of a variety of thin clients under various network conditions. 
We have selected the Unix-based AT&T VNC [13] thin client-server sys- 
tem to implement PELLPACK PSE as the WebPDELAb web server 
because it is open-source and free unlike the other available platforms. 

Users can connect to the WebPDELab site using any Java-enabled 
browser for information, demonstrations, cases studies and PDE prob- 
lem solving service. A new PELLPACK session is initiated for each user 
that connects to the WebPDELab server, and a unique identification and 
private file space for the session are created. The file space is available 
until the user disconnects from the service, at which time the session is 
terminated and the user’s files are deleted. Users may download files 
generated by PELLPACK to their own machines before terminating the 
session, and they may upload files to WebPDELab at the start of subse- 
quent server sessions. When the server invokes the PELLPACK system 
software, the entire PDE problem-solving environment described in [5] 
is presented to the user. This environment was described briefiy in Sec- 
tion 2.1 A detailed description of the functionality and operation of the 
PELLPACK software can be found in its User Guide that can be down- 
loaded from the web site. 

2.3. THE WEBPDELAB INTERFACE 

The WebPDELab server is accessed from http:/ /www. webpdelab.org. 
This web site is an instructional source for anyone interested in sob ing 
PDE applications. A collection of fully documented case studies is avail- 
able at the site presenting step-by-step solutions of common PDE appli- 
cations (flow, heat transfer, electomagnetism, conduction), with every 
user action and PELLPACK result described with images and detailed 
text. Users who request the PDELab problem solving service must first 
register with WebPDELab. After the user registration information is val- 
idated and the server connects to a host machine, WebPDELab presents 
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UPLOAD SERVER DOWNLOAD LOGOUT! 

Sew tcctcftd Ttairtb 17 13:46:19 EST 3000 

Server brincusil.cs.purdae.edu: 1 is busy... 

Server brincusil.cs.purdue.edu:2 is available ... 

Rtu^to bUKh ThuFtb 17 13:46:27 EST 2000 . Y(xr UcelD k ul34619 

Click on the items above to control the PDELab Web Server 

• UPLOAD upload PDELab files fi'om your machine. Upload files before clicking on Server 

• SERVER startup the PDELab System (password required ) 

• DOWNLOAD download PDELab files to your machine firom your user directory 

• LOGOUT! shutdown the PDELab Server 



Figure 1 The WebPDELAB server with control panel in the top frame and instruc- 
tions and connection information in the bottom frame 



a framed HTML page, with a control panel in the top frame (see Figure 
1) consisting of four options: Upload^ Download^ Server a,nd Logout The 
bottom frame contains the user identification number, host connection 
information, and instructions for using the options of the control panel 
in the top frame. At this point, the WebPDELab server has already cre- 
ated the user’s directory space, so users can upload files to their directory 
using Upload. Generally, users upload PELLPACK problem definition 
files from previous WebPDELab sessions, such as .e files, mesh files and 
solution files. Users can upload up to 20 files to their assigned directory 
space, and files may no longer be uploaded once a user clicks on Server. 

Download returns a listing of the user’s directory contents. Files in this 
directory can be viewed or downloaded from the listing, but since users’ 
directories are password protected, no other directories can be viewed 
or entered. Download is available throughout the user’s session. Users 
should look here frequently during the session to check on PELLPACK 
generated problem, solution and trace files. The Server option invokes 
the password protected PELLPACK software. After the password is 
entered and verified, the top level window of the PELLPACK system 
appears in the bottom frame of the browser window as shown in Figure 2. 
A collection of sample problems has been placed in the user’s directory, 
so users can load an example into the PELLACK session or begin their 
own problem definition. The session in Figure 3 is in the bottom frame 
of the WebPDELab server. The options of the control panel are still 
available in the top frame, but only the Download and Logout options 
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Figure 2 The PELLPACK top level window appears in the bottom frame of the 
WebPDELab browser window. It is ready for user interaction. 



axe enabled. The Upload and Server options remain disabled while the 
PELLPACK software is running in the bottom frame. 

During the PELLPACK session, WebPDELab is passing the display 
of the remotely executing PELLPACK environment to the users browser 
window. The graphical interface displayed on the user’s screen belongs 
to PELLPACK and is not described in this paper. When users click on 
Logout, the PELLPACK session is terminated and the user’s directory 
is removed. WebPDELab traces all user activities from the start of the 
server session until its termination. Users files are secure from other 
users, but WebPDELab ’looks at’ the contents of every file uploaded to 
WebPDELab or created by the user from within the PELLPACK system. 
Security mechanisms for the WebPDELab server and host cluster are 
discussed in Section 2.5. 

2.4. WEBPDELAB IMPLEMENTATION 

VNC is a remote display system that allows users to view a comput- 
ing “desktop” environment from anywhere on the Internet using a wide 
variety of machine architectures. VNC consists of a server that runs 
the applications and generates the display, a viewer that draws the dis- 
play on the client screen, and a TCP/IP connection between them. The 
server is started on the machine where the desktop resides, after which 
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Figure 3 A PELLPACK session running inside the WebPDELab browser window. 



any number of viewers can then be started and connected to the server. 
This allows the client user to access the applications, data, and entire 
desktop environment provided by the server. The viewer is a small, 
sharable, platform-independent, and stateless system that runs on the 
client machine. In the WebPDELab implementation, a new VNC UNIX 
server is started for each user who accesses the WebPELab web server 
from a Java-enabled browser (see Figure 4). The VNC Java viewer is 
started from the user’s browser, allowing the user to display and in- 
teract with the PELLPACK environment, which consists of X windows 
programs and libraries compiled and running on the i86pc SunOS 5.6 
host machines. Within this framework, any user worldwide who is con- 
nected to the Internet and has access to a Java-capable browser can run 
WebPDELab. 

The WebPDELab manager is the collection of CGI scripts (Common 
Gateway Interface protocol for browser to server communication) that 
controls all user activity once the PDELab Server at the WebPDELab 
web site is selected. When a user accesses the server, the manager col- 
lects information on all currently running VNC servers from the host 
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machines. The manager then asks the potential user to enter regis- 
tration information, including a valid e-mail address. After the e-mail 
address is validated, a unique user id is generated for the new user, and 
a log file is set up to track registration information, user aecess/exit 
times, and user activities while running the PELLPACK software. The 
host machine with the lightest traffic is selected by the manager to run 
the VNC server, and subsequently the PELLPACK software. A protec- 
tive client server application is used to launch the VNC server, so that 
users are never logged in to any machine in the host cluster. The VNC 
server startup invokes the PELLPACK system, and the manager creates 
the user directory, sends the control panel to the user, and monitors 
the user’s interaction with the control panel options ( Upload, Download, 
Server and Logout.) 

Upload is implemented using copyrighted public domain code at 
http://stein.cshl.org/WWW/software/CGI (Lincoln Stein, 1998). The 
code has been modified to operate with the WebPDELab/VNC user di- 
rectory privacy restrictions. Download is implemented as a standard link 
to the user’s file space, but additional password security protects a user’s 
assigned directory from all other users on the Internet. Server connects 
the VNC client user to the VNC server that has been instantiated for 
the caller on the selected host for a that server. 

After control has passed to the VNC client, the manager waits for a 
VNC disconnect or a click on Logout. When signaled to start exit pro- 
cessing, the manager saves the trace of user activities to the log database, 
kills the VNC server, and removes the user’s directory. The manager also 
checks all executing VNC servers periodically for sessions running longer 
than 24 hours, and these sessions are terminated. When the manager has 
finished exit processing, control is returned to the WebPDELab home 
page. 



2.5. WEBPDELAB SECURITY ISSUES 

When a user logs into the WebPDELab server, a CGI script is exe- 
cuted which generates the unique user id and requests one of the cluster 
host machines to invoke a VNC X-server. The WebPDELab CGI scripts 
reside on an isolated machine dedicated to serving CGI requests. This 
machine has no NFS-mounted disks, therefore an attacker attempting 
to take advantage of vulnerable CGI scripts is locked into the cgi-bin 
directory and cannot gain access to any other machines or disks. All 
parameters passed to WebPDELab CGI scripts are scanned to ensure 
they contain precisely the expected values (argument number, length 
and contents), else the request is terminated. The cluster machines lis- 
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Figure 4 Implementation of the WebPDELab server. 



ten on a fixed port for startup requests from the CGI machine. If an 
attempt is made to connect to this port that does not originate from 
the CGI host, the connection is immediately terminated. All cluster 
machines run a daemon that listens for socket connections on a spec- 
ified port and spawns a child process to serve the request, while the 
parent continues to listen for other connections, so that requests can be 
served simultaneously. A client program is invoked by the CGI script 
to contact the cluster machine, and requests that a new VNC X-server 
be launched. The client may only specify the VNC X-server startup pa- 
rameters, since the launching of the VNC X-server binary is hard-coded 
in the configuration file of the daemon serving requests originating from 
the CGI host. The VNC server itself is protected by a challenge-response 
password scheme. 

The cluster machines run the VNC X-server, as owned by a dedicated 
account whose root directory is the account’s home directory (using the 
Unix maintenance chroot command). All the required binaries are lo- 
cated in the chroot directory. If a user discovers vulnerabilities in one of 
the cluster machines, the user is locked into the home directory of the 
account, and is unable to cause harm to other accounts or disks. In or- 
der to protect the machine from unauthorized Fortran code inserted by 
a user into the PELLPACK .e file, specialized filters have been built into 
the original PELLPACK system. The PELLPACK language processor 
initially restricts the location of Fortran code to specialized segments 
within the PELLPACK problem definition file; these segments are now 
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re-parsed by filters that check inserted Fortran statements for unautho- 
rized code. 

Every user is provided with a unique directory for uploading and 
downloading files, thus facilitating the option of saving and retrieving 
material. The CGI script creates this directory after the registration 
information is entered and validated. User’s directories are password 
protected, securing each user from all other users. Every user file, how- 
ever, is opened and checked by WebPDELab for legal content as it is 
uploaded or saved by the user from inside PELLPACK. 

2.6. WEBPDELAB FEATURES AND ISSUES 

The significant benefits that can be obtained from the implementation 
of the WebPDELab server are: 

■ Generality. Any machine connected to the Internet can use the 
PELLPACK environment without concerns about language or ma- 
chine compatibility. 

■ Interaction. Users can specify the PDE with normal interaction 
speeds for the client machine, since data entry is done locally. 
The amount of code exported to support the user interface is sub- 
stantial (several megabytes), but it is only a small fraction of the 
PELLPACK system. If the user has no graphics capability, then 
the text based interface tools must be used; these are less conve- 
nient but still practical to use. 

As the PDE problem is being specified, information is sent to the 
server. The server might request additional information but once 
the problem is completely specified, it is solved on the server’s host 
machines. After the PDE is solved, the user can either view output 
generated by the server or request that the solution (normally a 
large data set) be returned for local use. 

■ Access to high performance computers. Any user can access ma- 
chines with sufficient power to solve the PDE problem. Even if the 
solution is too large to be sent to the user (or if there are no local 
visualization tools), the solution can be explored over the Internet. 

■ No code portability problems. Users do not need to have the code 
in the local machine language, since the software infrastructure 
operates only on the server’s host machines. 

There are several concerns and technical issues involved in the service 
provided by WebPDELab which we now discuss: 
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Performance of the user interface. There is a clear trade off in 
user interface performance between exporting code to the user’s 
machine and executing code on the server. Our initial prototype 
showed that communicating each mouse click back to the server for 
processing provides unsatisfactory interactive performance due to 
network delays. Our analysis indicates that almost all of the inter- 
action can be run locally by exporting a moderate amount of code. 
The user interface does use tools that are both time consuming to 
execute and which are too large to export. Examples are Maxima 
(used to transform mathematical equations) and domain proces- 
sors (used to create meshes or grids in geometric domains). These 
tools usually require pauses in response even without a network 
and the added delay due to networks is unlikely to be significant. 

Security for the server. While we control the material received 
from a user, the server is clearly subject to attack. We place the 
server on a separate subnet and access licensed software through a 
gateway. Since we know exactly what is to be sent via an RPC, it 
is possible to protect this licensed software. Even if a user succeeds 
in becoming “root”, access to other machines is not possible. Of 
course, network file systems and similar tools are not used. Our 
process of “registering” users when we give them accounts provides 
us with a chance to screen users before providing them access to 
WebPDELab. 

Security for the user. This requires each user to be completely 
isolated from all others. Each user on the server runs in a virtual 
file system using a login with no access privileges. Thus, each user 
appears to have the entire machine, and the protection mechanisms 
between machines, protects users from one another. This approach 
provides security at the cost of using much more memory than 
normally necessary. 

Software ownership and fair use. We prevent the copying of soft- 
ware by placing, if necessary, source code on another machine or 
another network and using secure RPC. 

Payment for computing services. The WebPDELab server is pro- 
vided free to users as well as time on associated servers used for 
security purposes. We do not foresee a need to charge users for 
time on these machines. If large numbers of users contend for ser- 
vice then they will be queued and the cost of the servers is clearly 
limited. 
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3. HETEROGENEOUS DISTRIBUTED 
COMPUTING VIA LAN/WAN 

High-performance networks have already become an integral part of 
today’s computing environments. This new capability allows applica- 
tions to evolve from static entities located and running on specific hosts 
to applications that are spread across Internet or Intranet based net- 
works. In a networked-based computing environment or computational 
grids the fundamental research problem could be viewed as the devel- 
opment of an infrastructure that supports the construction of applica- 
tions by composing dynamically application-specific components, mostly 
“legacy” , that are identified and linked together at run-time in a secure 
and robust way. In this paper we utilize and evaluate various soft- 
ware technologies and architectures to solve the above problem in the 
context of a multi-physics application [11]. We expect the experience 
gained by this study to allow other application researchers to realize 
the capabilities and limitations of existing technologies in their attempt 
to take advantage of a networked-based environment in their applica- 
tion domain. In [2] several back-end systems are reviewed to implement 
the networked-based scenario for different type of applications. In this 
study we experiment with the agent based Grasshopper system and with 
the Java RMI middleware. In this section, we give a short overview of 
Grasshopper and describe the design of the gas turbine engine model 
PSE (GasTurbnLab) and its two implementations. 

3.1. GRASSHOPPER AGENT SYSTEM 

Grasshopper [9] is an agent development platform that supports dis- 
tributed agent-based applications. The platform provides a base for 
communications services, mobile computing power and dynamic infor- 
mation retrieval. It is essentially a mobile agent platform that is built on 
top of a distributed processing environment, integrating the traditional 
client-server paradigm and mobile agent technology. The primary fea- 
ture of the Grasshopper platform is its location independent computing, 
driven by the ability to move agents between different systems. It is pow- 
erful tool that facilitates the creation of agents, transparently locating 
them and controlling their execution. These agent-based applications 
are interoperable with other agent systems that are MASIF (Mobile 
Agent System Interoperability Facilities) compliant. The Grasshopper 
Distributed Agent Environment is composed of regions, agencies and 
agents. At the top of the hierarchy is the region that manages the 
distributed components in the Grasshopper environment. Agencies and 
agents are associated with a particular region. Each region has a registry 
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that maintains information about all components associated with it. An 
agency is the runtime environment for mobile and stationary agents, 
providing the functionality to support agent execution. The agency is 
responsible for a number of services including: (1) communication ser- 
vices for all remote interactions that take place between Grasshopper 
components, their movements and transport. Interactions can be per- 
formed by CORBA, lOP, RMI or plain socket, (2) registration service to 
track all currently hosted agents, (3) management services that monitor 
external control of agents, (4) transport services for migration of agents 
from one agency to another, (5) security services for protecting remote 
interactions and agency resources, and (6) persistence services for en- 
abling the storage of agents for possible recovery. Agents axe computer 
programs characterized by a set of attributes. They can be mobile or 
stationary. Mobile agents move from one location to another within a 
region to take advantage of local interactions, and thus axe capable of 
reducing network loads by migrating. Stationary agents are associated 
with a paxticular location only and axe incapable of migxation. 

3.2. THE GAS TURBINE ENGINE 
MULTI-PHYSICS MODEL 

The gas turbine engine is an engineering triumph. It has more than 
1,300 parts with rotational speeds to 16,000 rpm for axial and 50,000 rpm 
for radial flow components. For aircraft applications, it operates with 
maneuver loads of up to lOg, with flow path pressures and temperatures 
to 40 atmospheres and 1400 F. The important physical phenomena take 
place on scales from 10-1000 microns to meters. A complete and accurate 
simulation of an entire engine is enormously demanding; it is unlikely 
that the required computing power, simulation technology or softwaxe 
systems will be available in the next decade. The primary goal of our 
research in this area is to advance the state-of-the-axt in very complex 
scientific simulations and to validate the simulation results. Specifically, 
we consider simulating the compressor-combustor coupling in a gas tur- 
bine engine. We view the application as a set of collaborating physical 
models. The hardware infrastructure assumed for these simulations con- 
sists of a computational grid involving a SP-2, 128 PC cluster running 
Solaris, and SGI Origin 2000 with 32 CPUs. Next, we consider the uti- 
lization of the Grasshopper agent system to implement GasTurbnLab. 
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Figure 5 The grid computing scenario for the Grasshopper implementation of the 
GasTurbnLab simulation. 



3.3. THE GRASSHOPPER 

IMPLEMENTATION OF THE GAS 
TURBINE ENGINE MODEL 

The Grasshopper environment supports the traditional client-server 
structure. GasTurbnLab servers are Grasshopper agents which provide 
computational service and data visualization. The GasTurbnLab client 
agent controls the entire simulation process by launching/terminating 
the server agents, managing their interactions and facilitating the cross- 
network asynchronous communication between computational and con- 
trol data. GasTurbnLab is implemented using four classes of agents: the 
simulation control agent (SC A), the visualizer agent (VA), the legacy 
code agents (LCA) and the mediator agents (MA). The SCA is the 
client which requests and manages the services of the remaining classes 
of agents. The VA is a server which receives solution data (velocity, 
speed, density, energy, pressure) for one or more engine parts, and ren- 
ders it graphically using the Iris Explorer data visualization system [10]. 
The LCAs are the computational agents which simulate the engine; each 
agent encapsulates an established legacy code targeted for a specific en- 
gine part. The GasTurbnLab model requires the LCAs to communicate 
their boundary data (via the SCA) to one or more MAs. The MAs en- 
capsulate the mediation codes which are responsible for adjusting and 
resolving interfaces (represented by the boundary data) between neigh- 
boring engine parts, each of which is simulated by one LCA. The Gas- 
TurbnLab SCA can handle any number of engine parts, that is, any 
number and type of communicating legacy code and mediator agents. 
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The initial challenge was to create a template procedure for embed- 
ding legacy Fortran codes into the server agent structure. The resulting 
LCAs could then exercise control over the Fortran code and enable data 
flow by starting up the legacy code, pausing after each iteration to com- 
municate the required boundary data to the SCA, receiving the mediated 
boundary data in return and continuing with the next iteration. Two 
legacy codes have been encapsulated as LCA agents: AleSD, an advanced 
CFD code for simulating turbines [7] and Kiva, an advanced combustion 
simulation code [1]. The template procedure requires the legacy code, 
which generally starts out as a Fortran executable, to be transformed 
into a C- wrapped legacy code library as follows: (1) change the Fortran 
main to a subroutine, with command line arguments passed as param- 
eters, (2) start the Fortran subroutine as a thread from the C wrapper 
routine, (3) deflne C structures to hold the data representing boundary 
information, (4) write a C data transfer routine to copy boundary data 
back and forth between the C and Fortran data structures, (5) insert a 
call to the C data transfer routine in the Fortran iteration loop, and (6) 
insert control code in the C wrapper to sleep/ wake the C data transfer 
routine, thus effectively controlling the pause and restart of the Fortran 
iterations. The changes to the Fortran code are minimal. The C wrap- 
per is defined with a JNI [14] interface to the Java agent code, and the C 
boundary data is passed up through the JNI interface parameters to the 
Java agent. The Fortran code and two C routines are compiled together 
as a legacy library which is loaded into the Java LCA server when it 
is instantiated. The Java legacy agent starts the C wrapper and waits 
for the JNI object containing the boundary data. When the object is 
received, the agent serializes it and communicates it to the SCA. The 
SCA returns a mediated data object to the agent, which copies it into a 
JNI object and passes it to the C wrapper. The SCA is also responsible 
for passing the legacy agent a termination signal. 

Mediator agents are customized Java programs that are easily encap- 
sulated as Java agents. The interface relaxation code is heavily depen- 
dent on the legacy codes and data types which are being mediated. The 
mediator code is wrapped by agent code which handles the boundary 
data object communication between the MA and the SCA. 

The GasTurbnLab project has successfully implemented an AleAgent 
(LCA) for simulating both rotor and stator engine parts (an engine part 
is also called a domain), a KivaAgent (LCA) for simulating a combus- 
tor, and a MA to mediate boundary data. The MA supports mediation 
between two Ale3d domains (stator and rotor) and between Ale3d and 
Kiva domains (stator and combustor). The first prototype was a two 
domain Ale3d stator-rotor simulation. We assume a running Grasshop- 
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Figure 6 Implementation of the Legacy Code Agent. 



per region with 3 or 4 registered agencies on machines belonging to the 
GasTubnLab computational grid (see Figure 5). Two AleAgents, an MA 
and a VA are loaded into (any combination of) the agencies and auto- 
matically registered with the region. The SCA is loaded into an agency 
and started as a stator-rotor controller. The SCA first looks up the 
registry to determine the availability of the necessary servers. It then 
supplies the two AleAgents with grid and parameter information, and 
the server agents commence execution. After the encapsulated legacy 
code finishes one cycle of computation, the boundary data is copied up- 
ward from the Fortran code to the C-wrapper to the JNI object to the 
Java agent. The SCA receives this data from the AleAgents, then filters 
and serializes the data for the MA. The MA is called with this data 
and, after running the interface computations, it returns the mediated 
data to the SCA. The SCA restarts the AleAgents with the mediated 
data. This process continues until the termination signal is given. The 
VA is launched with solution data from both domains, and users can 
interactively choose which domains and solution data to display. This 
prototype simulation has run thousands of iterations and the solution 
output data has been verified. 

3.4. THE JAVA RMI IMPLEMENTATION OF 
THE GAS TURBINE ENGINE MODEL 

The presence of the Grasshopper platform simplifies several aspects 
of the implementation, such as the asynchronous communication and 
the agent management. However, the current implementation of the gas 
turbine engine simulator does not use Grasshopper for agent migration 
or persistent storage. Thus, we have considered an implementation that 
uses only Java RMI. A direct Java implementation gives us better insight 
and control over our system, at the cost of programming ourselves part of 
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the functionality provided by Grasshopper, mainly asynchronous remote 
method calls. 

Our Java RMI implementation of the gas turbine model was based 
upon the Grasshopper design. Thus, the software interfaces of the model 
modules are either the same as those in the Grasshopper implementa- 
tion, or (in the worst case) equivalent to them. The control flow is 
almost identical. For keeping track of the available agents we created a 
simple custom registry that itself uses the standard RMI registry pro- 
vided by the Java SDK. In the two-domain problem described for the 
Grasshopper implementation, we first start our custom registry. All the 
agents (AleAgents, MAs, VA) are registered in the same custom registry. 
The SCA looks up the custom registry for available agents. If enough 
agents are available, the SCA supplies the AleAgents with parameters, 
and the first cycle of computation begins. Hereafter the execution and 
data copy chain between the various agents and wrappers continue in 
the way described in Section 3.3 until the computations are complete. 
This implementation has shown that pure Java technology can be easily 
utilized to implement distributed computations. In the near future we 
will compare the efficiency of the two implementations. 

4. DISTRIBUTED SCIENTIFIC LIBRARIES 
AND PSES 

In this section we demonstrate two middleware approaches to manag- 
ing the software and computational resources with respect to data loca- 
tion, application code, and hardware. The approaches are based on the 
so called remote computing or traditional client-server approach, such as 
that provided by CORBA. The code of interest is accessed as a service 
provided by a computational server. The user’s data is sent to the server, 
the requested operations are performed, and the result is sent back to 
the initiator of the computation. In this computational paradigm, there 
is generally a tight binding between the software and the computational 
resources on which the software runs. We use the Java RMI middleware 
to implement remote computing with the ITPACK library, and expand 
the PELLPACK PSE utilizing the NetSolve middleware and its servers. 

4.1. A WEB-BASED ITPACK LIBRARY 
SERVER 

In this section we describe the implementation of ITPACK library 
as a remote computational server utilizing the Java RMI technology. 
ITPACK library implements a number of iterative methods for solv- 
ing finite element and difference equations resulting mainly from the 
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Figure 1 Java RMI based ITPACK wrapper interface. The user selects the module 
and its parameters, inputs the matrix in a sparse storage scheme and inputs the 
right-hand-side. The system returns the solution. 



discretization of elliptic boundary value problems. Java’s RMI capabil- 
ity allows programmers to invoke object methods residing on a remote 
machine. More importantly, it allows the serialization of objects, thus 
enabling the client and server to reference remote objects. The Fortran 
code of the ITPACK library routines were wrapped with a simple C func- 
tion that accepted relevant parameters in order to invoke the appropriate 
ITPACK routine. This C wrapper was compiled into a library, which 
was loaded by the top-level Java RMI server method. The Java client 
and server programs shared common understanding of the object to be 
referenced. Such an object was initialized by the client program and 
then passed over to the remote RMI method, which then called the C 
ITPACK wrapper routine. After the execution of the RMI method, the 
variables of the client program object contain the computed solution. In 
addition, we implemented a web-based interface to the ITPACK server, 
shown in Figure 7. It consists of a simple HTML form that posts the val- 
ues of the parameters entered by a user to the Java RMI client program, 
and outputs its result in the web browser. 





22 ARCHITECTURE OF SCIENTIFIC SOFTWARE 



4.2. A NETWORK-BASED PELLPACK PSE 

In this section we consider a scenario for PELLPACK where either 
code is shipped to the requesting location (executing in the requester’s 
environment using “local” data) or the code of interest is accessed as 
a service provided by a computational server. In the second paradigm, 
the intermediate PELLPACK data is sent to the server, the requested 
operations are performed, and the result is sent back to PELLPACK. 

The first computational paradigm could be implemented by a network 
file system to support the sharing of well-defined software libraries with 
PELLPACK across LAN and WAN computational environments. Many 
different versions of network file systems exist, e.g. Sun’s Network File 
System (Sun NFS), the Andrew File System (AFS) [15] and the Coda 
file system [3]. The Sun NFS is primarily designed for local area net- 
works while AFS and Coda were developed with a wide area network 
application in mind. As a consequence, different approaches are em- 
ployed for caching, replication, and recovery, which affect performance 
(especially over slow links) well as data integrity. The most crucial dif- 
ference, however, lies in the level of security offered. Specifically, AFS 
and Coda employ rigorous authentication protocols, thereby ensuring 
that the server will reject “unauthorized” requests. On the other hand, 
in the off-the-shelf Sun NFS, the client authenticates the user while the 
server accepts and executes incoming client requests. It is thus (theoret- 
ically) possible to construct “man in the middle” attaxjks, circumventing 
the proper client authentication procedure. 

When NFS or AFS exported shares are mounted locally, the end-user 
can deal with them as if they are contained in another directory present 
on their local disk. Since PELLPACK makes use of various libraries 
to perform its computations, these libraries can be stored at various 
machines on a WAN or LAN and exported for use by local machines. The 
main advantage of such an approach is that the PELLPACK program 
does not recognize the remote existence of these libraries, and makes use 
of them as if they were available locally. 

The second computational paradigm is currently implemented by uti- 
lizing the NetSolve Project. NetSolve is a RPC based client/agent/server 
system that supports remote ax:cess to both hardware and software com- 
putational resources distributed across a network. NetSolve responds to 
a user’s request for service and solves the problem and returns the so- 
lution to the user. NetSolve provides secure yet flexible mechanisms for 
cooperation and control between resources, processes, data and users. 
NetSolve has implemented interfaces in Fortran, C and Matlab for a 
large number of computational software components. 
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Figure 8 A networked computing view of the PELLPACK PSE supported by a dis- 
tributed file system and/or NetSolve 

The open architecture of the WebPDELab PSE supports the easy 
integration of software components used in the PDE problem solving 
process. The WebPDELab framework that has been implemented for 
the NetSolve interface allows WebPDELab users to specify any of the 
computational components available through NetSolve (i.e., components 
which are related to the PDE solution process) from within WebPDE- 
Lab’s graphical environment. This framework opens up the entire Net- 
Solve environment for use by the WebPDELab PSE. It provides: 

1 The running server and agent processes required by the NetSolve 
server, NetSolve requires users to download, compile, link and con- 
figure the system that comprises NetSolve’s agent /server control, 
communication and hardware/software interfaces. This software is 
installed and running on WebPDELab’s supporting i86pc machine 
cluster, and allows WebPDELab users to access the NetSolve sys- 
tem without downloading and installing the NetSolve system on 
their own machines. 

2 A library of interface routines handling the conversion oiWehPDE- 
Lab data structures (matrices, parameters, etc) to NetSolve prob- 
lem definition data structures and back. The NetSolve system 
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requires users to write their own Fortran, C or Matlab code to de- 
fine the problem and set up the data structures in preparation for 
a call to the NetSolve software/hardware resources. This process 
has been automated in WebPDELab by allowing users to select al- 
gorithms (such as linear solvers) and their parameters graphically, 
and then providing data conversion and NetSolve communication 
services without any intervention by the user. 

3 A call to NetSolve requesting service for the user-specified algo- 
rithm, with the call setup and problem definition parameters tahen 
care of automatically by the WebPDELab server. 

The WebPDELab-NetSolve integration framework supports the in- 
clusion of new NetSolve algorithms as a simple two-step process which 
is handled by the WebPDELab support group. First, a simple con- 
version routine is written to transform WebPDELab data structures to 
NetSolve structures and added to the WebPDELab interface library. 
Second, the name of the new algorithm and a list of its parameters are 
inserted in WebPDELab’s language processor and in the graphical al- 
gorithm specification module. Since only NetSolve software related to 
the PDE problem solving process makes sense in the WebPDELab envi- 
ronment, these are the NetSolve algorithms that have been integrated. 
The WebPDELab-NetSolve integration has been tested for a collection 
of sparse linear solvers from the numerical software resources provided 
by NetSolve. 

5. CONCLUSION 

“Many a mickle makes a muckle” observes a recent article in 
Economist magazine. It predicts that harnessing the spare computing 
capacity scattered around a zillion desktops through the Internet soon 
will be reality and worth real money. Already there are several commer- 
cial and research projects trying to exploit this “something-for-nothing” 
idea. Distributed computing over the Internet is known by many terms, 
such as network computing, meta-computing, seamless scalable com- 
puting, networked based computing, net-centric computing, grid com- 
puting and internet computing. The realization of network computing 
paradigm depends on several developments that include middleware for 
heterogeneous, fault tolerant and secure computing in a computational 
grid involving thousands of machines, partitioning and scheduling of 
large computational problems, and intelligent management of all these 
resources. In this paper, we argue that network computing is feasible 
with existing software and algorithmic technologies, even for computa- 
tions based on “legacy” codes. We have shown that "legacy” scientific 
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PSEs and libraries can be used in a network computing setting. We have 
experimented with partitioning, mapping, and execution of a large-scale 
multidisciplinary application simulated by ’’legacy” software across a 
network of resources utilizing an agent based middleware (Graisshopper) 
and pure Java technologies. This experience confirms the feasibility of 
the network computing paradigm and its potential. 
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DISCUSSION 

Speaker: Elias Houstis 

Vladimir Getov ; The second project as presented in your talk makes 
extensive use of agents. Is there a specific reason for that? 

Elias Houstis : The agent computational paradigm fits intellectu- 
ally better the MPSE concept and the networked mathematical models 
for simulating multidisciplinary applications like the gas turbine engine. 
Moreover, it provides a higher level middleware for implementing hetero- 
geneous distributed computations. We have a Java RMI implementation 
of the GasTurbnLab that works equally well for small number of compo- 
nents. It remains to be seen whether the two implementations will scale 
in efficient way. 

Morven Gentleman : Can you provide an estimate of the integration 
effort that it took to make each of your three examples operational? 
Elias Houstis : The WebPDELab is a one work-year effort. The Gas- 
TurbnLab is a five work-year effort. The distributed WebPDELab is less 
than a one-half work-year effort. 

Margauret Wright : Could you comment on the issues arising from 
the user interface? Some users want to use an advanced graphical user 
interface, but there is also a need for file-based input (for data), and 
perhaps some specialized modeling languages. What approaches have 
you taken? 

Elias Houstis : We have been experimenting with graphical, very high- 
level domain-specific languages, and visual programming languages in 
the domain of PDEs. We found to be in general useful. In the case of 
GasTurbnLab, we have utilizing IRIS Explorer commercial product to 
implement its interface. It supports visual programming and graphics. 
Ronald Boisvert : In GasTurbnLab you are gluing together many 
modeling components, each of which is very complex. As you scale up 
to a large number of these, how do you verify that you are getting the 
right answers to the problem? 

Elias Houstis : We are using specialized software modules/agents, 
called mediators, among the various components that measure physical 
and mathematical indicators, implement engineering intuition, and make 
decisions about the analysis/design iterative process. 

Ivor Phillips : Since you have opened your systems to the whole world 
there is a concern that someone may submit a problem so large that it 
will consume all of your computing resources for weeks (or more). Do 
you have facilities for checking submitted problems for such issues and 
rejecting them? 
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Elias Houstis : We monitor WebPDELab user activities and we could 
take appropriate action at any time. We have not faced this problem so 
far. Thus, we have not instituted policies and/or facilities to avoid or 
discourage such situations. 

Richard Fateman ; Do you check for plausibility of input at the inter- 
faces between the numerous distributed modules in your applications? 
Elias Houstis ; We filter the input only for unwanted code. Usually, 
users consult us when they have problems. 
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Abstract This paper discusses several dimensions involved in the design and im- 
plementation of future generations of Problem-Solving Environments 
(PSEs). The paper surveys the main requirements posed both by end 
users and by system developers. The main issues on the development of 
future generation of PSEs axe identified. A case study is then discussed 
which relates to an ongoing project in the author’s institution. This 
research concerns the study of component coordination in a dynamic 
PSE and how this issue may infiuence the design of the architecture of 
a generic PSE. 

Keywords: Problem-solving environments, coordination. 

1. PROBLEM-SOLVING ENVIRONMENTS 

A Problem-Solving Environment (PSE) aims at helping an end-user 
in the specification and solution of a problem in terms of concepts specific 
to the problem domain. It should allow the development of rapid proto- 
types to ease the experimentation with specific solutions and allow the 
user to learn from experience. Several recent technologies are enabling 
to develop more fully integrated environments, ranging from parallel and 
distributed computing, component based systems, advanced interactive 
visualization, intelligent knowledge processing and discovery, to large- 
scale distributed computing. The awareness to these issues has been 
emerging in multiple projects all over the world[5, 6]. 

A PSE is an integrated environment supporting an entire life cycle 
of development and execution steps to solve problems in a specific ap- 
plication domain. The development steps help the user in producing a 
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specification of the problem to be solved and to support its rapid pro- 
totyping so that it may be submitted to execution. This involves tools 
ranging from visual specification languages to intelligent components 
providing expert assistance to help the user generating and tailoring the 
PSE to the specific application and user needs. 

The execution steps allow the user to interact with an ongoing ex- 
periment, by controlling and monitoring its evolution. They support 
the visualization, processing and interpretation of the input or gener- 
ated data, according to the user’s interests at each point during the 
experiment. This requires the ability to perform activities on a diversity 
of heterogeneous components, such as the problem solvers, their asso- 
ciated expert assistance tools, tools for data processing, interpretation 
and visualization, tools for monitoring and computational steering, on- 
line access to large databases. Examples of such activities include the 
selection, evaluation and testing of individual components, their acti- 
vation, interconnection and configuration, the management of working 
sessions, and the monitoring and control of their dynamic evolution. 

Some of those components are specific to the application domain, 
while others are generic tools that can be adapted according to each 
application and experiment. The diversity of the above mentioned com- 
ponents requires adequate infrastructures to support heterogeneous com- 
puting models. In the past decade, many issues for handling heterogene- 
ity including component interconnection have been addressed by multi- 
ple models and associated middleware platforms. This has enabled the 
development of higher levels of functionalities to support the mentioned 
activities. 

In the remaining of this paper. Section 2 identifies requirements for 
modern generations of PSEs. Section 3 presents the main dimensions 
that should be considered when developing those environments and 
briefly survey approaches to develop more flexible infrastructures for 
PSEs. Section 4 describes ongoing work towards building more flexible 
PSEs. 

2. REQUIREMENTS FOR FUTURE 
GENERATIONS OF PSE 

Due to the complexity of the simulation models, the large volume 
of data, and the difficulties of their interpretation, PSE must satisfy a 
series of requirements concerning the end-user and application needs: 

■ Higher Degrees of User Interaction. Increased flexibility in user 
and component interaction demands user interfaces at distinct ab- 
straction levels. On one hand this requires more advanced tools 
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for computational steering and advanced visualization. On the 
other hand, it requires distinct modes of operation in the same 
PSE, e.g. allowing off-line of on-line processing or visualization, 
to be selected depending on the user interest at each point dur- 
ing an experiment. This also requires the ability of bringing new 
components into an existing environment in order to provide some 
specific functionalities. 

■ Intelligence in PSEs. Advisoring, explaining, and expert tools axe 
important to assist the user during the development and execution 
steps. The search for a balance between automated intelligent tools 
and an adequate level of user interaction will be a major issue in 
future PSEs. 

■ Multidisciplinary Nature of the Applications. This poses the need 
to support interactions between distinct sub-models, based on mul- 
tiple heterogeneous and hybrid components, e.g. for the coupling 
of numerical codes, or the interaction between evolutionary com- 
puting models. On the other hand, PSEs should evolve towards 
distributed collaborative environments, to enable the interactions 
and coordination of activities among multiple users which are ex- 
perts from different subproblems. 

The PSE architecture must address the following main issues in order 
to enable the above user requirements. 

■ Infrastructures for PSEs. A PSE should be able to work on top 
of low-level and middleware layers which provide the services of a 
meta-level distributed operating system for cluster computing and 
global computing platforms. Besides heterogeneity issues, other 
aspects must be addressed such as operation at a small or large 
scale, security, resource management and system configuration. 

■ Software Architectures. Flexible PSEs require the ability to dy- 
namically adapt the tools and the software architecture of the en- 
tire environment. As the user interests may change during both the 
development and execution steps, the focus is put on the reuse of 
components and their dynamic modification, relying upon object- 
oriented and component-based technologies. On the other hand, 
as applications become more complex, including a large diversity 
of components, one needs models and tools to support the abstract 
specification of PSEs, the reasoning about global system properties 
and the transformations between software levels. 
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■ Building PSEs. Historically, PSEs have been developed by ’’man- 
ually” assembling a usually small set of components that are inter- 
connected in a specific way for a specific application. New methods 
for developing and generating PSEs axe necessary in order to meet 
their intended fiexibility and their increased complexity and size. 
This requires the identification of more generic architectures and 
services for PSEs, that can be tailored to specific classes of target 
problem domains. It also requires tools for supporting the more / 
less automatic generation of specific PSEs. 

■ Dynamic Configuration and Coordination Issues. Dynamic PSEs 
will support the modification of components and their interaction 
patterns. This requires both theoretical and practical develop- 
ments on the design of abstract patterns of interactions, on the 
dynamic reconfiguration of software architectures, and on the co- 
ordination of distributed systems. 

These requirements pose new challenges to future generations of PSEs 

[ 10 , 12 ]. 



3. MAIN DIMENSIONS IN PSE 
DEVELOPMENT 

A PSE involves tools which are specific to the application domain, 
e.g. a simulator, and other more generic ones, such as a monitoring 
tool. There is a need for tools supporting the application building, by 
selecting, evaluating, testing, configuring, activating and interconnect- 
ing, monitoring and controlling the execution of multiple heterogeneous 
components. These main dimensions are represented in the figure 1. 

The development of a PSE addresses issues at several distinct ab- 
straction levels. Coordination concerns the consistent representation 
and management of dynamic patterns of interaction among components 
[7], and the definition of the corresponding cooperation and commu- 
nication models. It requires adequate models and frameworks for the 
software architecture of the PSE. Software Architectures concern the 
specification of the structure of a system in terms of its components 
and interconnections[ll],and it provides the models and tools to reason 
about global system properties. Monitoring and Control includes sup- 
port for the observation and control of distributed experiments, such as 
distributed monitoring, computational steering and advanced visualiza- 
tion[8, 9]. Resource Management and Interconnection Services handle 
configuration of parallel and distributed heterogeneous virtual machines, 
activation of component instances, infrastructures for the interconnec- 
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Figure 1 Dimensions in PSE development 



tion of heterogeneous components, and management of local and large 
scale operations for clusters and metacomputing[4]. 

Among the diversity of ongoing projects[6] we mention some of the 
representative efforts which are opening the way to more advanced PSEs. 
Glohus[4^ provides an infrastructure for metacomputing, "the GRID”, 
giving access to large scale distributed resources and allowing the devel- 
opment of high-level services. Distributed Computational Laboratories]^] 
supports increased interactivity in high-performance computing for sin- 
gle and collaborative users. It provides an infrastructure for distributed 
resource management, and services for the management of experiments 
with computational steering, monitoring and dynamic system behavior. 
A Generic Problem-Solving Environment[lZ] is building a generic infras- 
tructure to implement PSEs for distinct application domains. It relies on 
an infrastructure for distributed computing and it offers an intermediate 
layer with a set of generic services for the specification of components 
and for abstract resource management. An application dependent layer 
is then used to build specific PSEs for each domain. 

4. AN EXPERIENCE TOWARDS DYNAMIC 
PSE 

A project at the author’s institution aims at building more flexible 
and dynamic parallel and distributed PSEs[3]. One goal is to develop a 
framework for PSEs consisting of heterogeneous components. It should 
allow the design of flexible and extensible tools supporting observation 
and control services, as well as the study of dynamic PSEs, their software 
architecture, and the required coordination models. Another goal is to 
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use the above framework to implement prototypes for specific application 
domains which can be used in real applications and adapted according 
to the user needs. This contributes to improve the functionalities and 
tools offered by the PSE. 

This project initially involved the cooperation with colleagues from 
the Environmental Sciences and Engineering Department [2, 3] for the 
design and implementation of a system architecture for Parallel Genetic 
Algorithnas (PGA). The system also included tools for off-line / online 
processing and visualization of the evolution of the GA computations, 
as well as tools for on-line modifications of the parameters. Several pro- 
totypes were developed for the execution, visualization and steering of 
PGAs. These versions only supported a static configuration consisting 
of multiple heterogeneous components. They mainly differ in the dis- 
tinct implementations of the GA component (based on shared-memory 
or distributed-memory models), and of the steering components. A 
flexible monitoring and control architecture (DAMS) supporting hetero- 
geneous tools was designed and used to support control and resource 
management services. DAMS is based on a design which only provides 
the minimal functionalities for observation and control of a distributed 
application. The main idea is to allow incremental extension of new 
services, depending on the changing requirements. DAMS is neutral 
concerning the supported services and the target application model. In- 
stead of a fixed API, DAMS allows each service module to provide a 
specific interface, and allows the configuration of the corresponding low 
level drivers which act upon the target application. DAMS was used 
to implement a resource management service for the configuration and 
steering of the mentioned PGA prototype. 

A distributed architecture brings increased potentialities to integrate 
distinct components. Due to the complexity and heterogeneity of mod- 
ern applications, one often needs to subdivide them, each subproblem 
being solved by a distinct model which is allowed to evolve autonomously. 
Still they must be able to interact and cooperate due to global appli- 
cation constraints or to improve global application behavior. In order 
to meet these requirements, the development of a generic heterogeneous 
component-based environment is under way. It will provide increased 
flexibility in the configuration and activation of its components, the pro- 
gramming of their interactions, and the monitoring and control of their 
global and individual behaviors. 

In the most simple case, the components are statically specified and 
the configuration of the PSE remains unchanged during an entire experi- 
ment. In order to provide increased flexibility and allow the user to have 
a more interactive role regarding the execution of an experiment, it is 
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necessary to consider the dynamic insertion and removal of components. 
In this way a single user is able to change the system configuration as 
a specific experiment progresses, in order to evaluate distinct aspects 
of the problem. Also, multiple users can concurrently join ongoing ex- 
periments with distinct roles (observers, controllers). Computational 
steering plays an increasingly important role in many complex applica- 
tions to help the user learning how the simulation behaves depending on 
a diversity of application and system parameters. It is also important 
to allow the user to focus on specific parts of the problem models or to 
specify the most desirable levels of detail in complex systems. Besides 
user driven steering, agent driven steering can be useful to allow the au- 
tomatic control of the evolution of a computation. Previous knowledge 
about problem behavior can be integrated into intelligent controllers 
that may act autonomously upon the computational components. For 
heterogeneous applications it could be useful to simultaneously support 
user driven and automatic steering components. 

Besides the definition of suitable interfaces for the monitoring and 
steering components, all of the above requires the system to provide 
adequate mechanisms for the consistent coordination of multiple com- 
ponents. We are studying how support for dynamic reconfiguration can 
increase the flexibility of a PSE. In order to evaluate this aspect, we 
are designing a collection of application level scenarios which involve 
multiple tools and components of a PSE. Then we analyze how their 
dynamic reconfiguration can improve the expressiveness of the life cycle 
for application development and execution. The final goal is to be able 
to model a diversity of interaction patterns between components and to 
support their dynamic modification. The model defines a collection of 
coordination operations that allow the control of components and their 
interconnections. 

5. CONCLUSIONS 

A survey was presented of the main dimensions in the design of future 
generations of Problem-Solving Environments. A hierarchy of concep- 
tual layers allows to identify the main issues. Namely, coordination is- 
sues and specification of software architectures are important to handle 
the increased complexity of dynamic PSEs and their flexibility. An out- 
line of ongoing work was presented concerning experimentation towards 
developing flexible dynamic PSE and tools. 
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DEVELOPING AN ARCHITECTURE TO 
SUPPORT THE IMPLEMENTATION AND 
DEVELOPMENT OF SCIENTIFIC 
COMPUTING APPLICATIONS 
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Abstract The increasing complexity and computational demands of scientific com- 
puting applications has led to a need for robust software architectures 
to facilitate the conceptualization, design, implementation, deployment 
and maintenance of these applications. This paper attempts to shed 
light on how the unique characteristics of scientific computing applica- 
tions, as well as computational scientists, make impositions upon the 
framework used to support scientific research efforts. We use our ex- 
perience with NetSolve, a toolkit designed just for such interactions, 
as a means to present the approach of one infrastructure to support 
scientific computing. We further discuss how NetSolve implements the 
unique model of using a single system to aggregate, manage and access 
distributed hardware and software resources. 

Keywords: scientific computing, distributed systems, grid computing, heteroge- 
neous network computing, NetSolve 

1. INTRODUCTION 

Today, scientists and engineers have become completely reliant on 
the computer as a tool for advanced modeling, simulation experimental 
analyses. Scientific simulations ranging from that of the electromagnetic 
field of a difribulator in a virtual human to warfare scenarios with tens of 
thousands of components interacting together are now possible thanks to 
significant advancements in computer technology. While the rapid rate 
of microprocessor performance growth has been constant, decades ago 
scientists realized that an intuitive way to increase the computational 
capacity of any single processor /machine is to use software systems that 
leverage computer networking infrastructures to connect multiple pro- 
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cessors/machines together. Indeed, the scientific computing application 
(SCA) that runs solely on a single processor has become a rare breed. 
Supercomputers, with massively parallel processors (MPP) or shared 
memory multi-processors (SMP), are very common and today’s fastest 
computers operate at speeds of over 2 TFlop/s. These speeds have been 
achieved by improved chip technology and computing architectures that 
support parallel processing. As computer scientists and engineers ex- 
plore ways to increase computational capacities and capabilities, com- 
putational scientists now face highly complex architectures with each 
variation itself requiring a variation of the parallel programming model 
needed for code accuracy and performance. Furthermore, since every 
SCA demands interdisciplinary expertise (mathematics, computer sci- 
ence, and the domain science at the very least), there must be simplistic 
ways for the computational scientist to leverage the effort of algorithm 
developers, software system designers and hardware engineers and merge 
them with his own expertise. 

The emergence of technologies like PVM [1] and st 2 indards like MPI [2] 
show strong efforts to develop a common parallel programming environ- 
ment. Even more recently, there has been the emergence of Grid com- 
puting concepts [3] that envision software technologies exploiting today’s 
high network connectivity to create a single, global virtual machine. 
However, little attention has been focused on the computational scien- 
tists and what is required for them to accomplish useful tasks without 
the colossal nightmare of becoming a computer scientist and a math- 
ematician (and maybe even a magician) overnight. The goal of this 
article is to analyze the unique needs of SCAs and the scientists that 
develop them in an attempt to establish the issues that supporting soft- 
ware infrastructures should resolve. In the following, we also discuss the 
NetSolve distributed computing environment which attempts to address 
these issues in a practical and efficient way. Section 2 defines what we 
have found to be the fundamental characteristics of scientific computing 
that should help to mold the supporting scientific computing application 
infrastructure (SCAI). After a general overview of the NetSolve system 
in Sect. 3, Sect. 4 evaluates some of the features of NetSolve based on 
the gathered requirement of scientists and their applications. What we 
hope to present is not only an introduction to NetSolve, but a means 
by which to evaluate any infrastructure that claims to support scientific 
computing based on the needs of that community. 
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2. CHARACTERISTICS OF SCIENTIFIC 
COMPUTING 

Scientific applications typically have four phases as depicted in Fig. 1. 
Regardless of whether the application is Graphical User Interface (GUI) 
based or text based, the first phase entails the gathering and preparation 
of input data. This may mean getting user input, allocating memory re- 
quirements, constructing specialized data structures and the like. After 
the input is made ready, it is passed to the computationally intensive 
phase of the process when complex algorithms are run on the prepared 
data. After the data processing, often there is an analysis phase which 
may be used to determine if further processing of the data is necessary, 
among other things. The data processing and analyses phases may un- 
dergo several iterations until finally some output results are made avail- 
able. This may be in the form of textual output or graphical images 
that visualize the simulation and/or its results. 




Figure 1 The four typical phases of a scientific computing application 

So far, we have made no mention of possible parallelizations that 
may take place in the program. Typically, the only difierence between 
parallel applications and serial ones is that in a parallel application, 
the data processing phase is distributed amongst multiple processors. 
Prom this point on, we assume the more common scenario of a parallel 
application. At this point, we distinguish two general categories of par- 
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allel applications; this categorization depends on whether or not there is 
communications amongst the computational nodes during the computa- 
tion. Figure 2 depicts what we refer to as cooperative parallelism where 
there is interprocess communication amongst the computational nodes. 
Figure 3 shows the other kind of parallelism where there are concur- 
rently running modules on multiple processors, but no com m unications 
amongst the nodes; we term this independently parallel execution. 



Input 

Preparation 




Figure 2 The parallel scientific 
computing application with cooper- 
ative parallelism 



Input 

Preparation 




Figure 3 The paieillel scientific 
computing application with inde- 
pendent parallelism 



The following characteristics of SCAs have been collected from a 
variety of sources, including [4] in an attempt to provide a complete 
picture of the demands of scientific computing. The format of the items 
that follow is a description of the characteristic followed by a single 
(emphasized) sentence that summarizes the impact that characteristic 
has on SCAIs: 



•Knowledge Base of the Scientist 

Whether it be a computational chemistry or nuclear engineering ap- 
plication, the computational scientist will have a deep knowledge and 
understanding of the concepts of their scientific domain. They may also 
possess a thorough understanding of relevant mathematical concepts. 
However, their expertise in programming and other aspects of software 
engineering is typically limited. In larger collaborations, the application 
development is often carried out by a concert of domain scientists, com- 
puter scientists and, perhaps, mathematicians. Not every organization 
and research effort can afford such luxuries, nor do they wish to. 

The SCAI must provide an easy and intuitive programming model that 
allows computational scientist to integrate complex supporting software 
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and highly- optimized mathematical modules without advanced levels of 
expertise. 

•Composition of Existing Software Components 

To help alleviate the previous problem, highly optimized numerical soft- 
ware libraries are often made available. Numerical solvers (linear/non- 
linear systems solvers, differential equation solvers, etc.) are the fun- 
damental subprograms found in practically all SCAs; packages like LA- 
PACK [5] and PETSc [6] are only two examples of libraries that imple- 
ment some of these sophisticated techniques. As is the case with these 
examples, a wide array of this software is freely available, and even open- 
sourced. For obvious reasons, the domain scientist will want to leverage 
the years, if not decades, of research and development incorporated in 
the older packages or the novel concepts and capabilities of the newer 
ones. 

The SCAI must provide ways to easily incorporate largely varied types 
of computational software modules. 

•Granularity of Computational Modules 

The granularity of the computational modules can be defined as the re- 
lation between the floating point performance of the module and the av- 
erage communication bandwidth of a single processor (Flop/By te) [7]. A 
higher granularity means that the computational modules take on large, 
atomic chunks of work at a time, while a lower granularity implies that 
not much processing takes place before data is transported. The granu- 
larity of the computational modules can depend upon the class to which 
that module belongs. Lower level classifications like a mathematical lin- 
ear system solver (or even the matrix multiply routine upon which the 
solver depends) leads to low granularities. Higher level classifications like 
a data transformation module that incorporates linear systems solvers, 
or an image processing module that uses a series of data transforma- 
tions yield higher granularity. Depending on the application, it may be 
warranted to use a large granularity, a small one, or both. 

Software module granularity should bear little or no impact on the 
performance or ease- of -integration of the SCAI. 

•Too Many Choices 

As can be seen, there are many software packages available from a variety 
of sources. While this allows for much flexibility when deciding which 
packages to use, it also implies climbing a steep learning curve to deter- 
mine the peculiarities of each package to make a fair evaluation of which 
suits your purposes. The veritable alphabet soup of packages available 
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can be daunting, yet discovering which package to use is the least of the 
troubles. As is the case of iterative methods, an algorithm is only as 
good as the manner in which you use it. Veterans of iterative methods 
agree that finding the right combination of solver, pre-conditioner, scal- 
ing and re-ordering is an art form developed only from experimentation 
within the application area [8]. 

The SCAI must hide the complexities and intricacies of the underlying 
computational modules while providing convenient ways for the scientist 
to discover and use the services he wants. 

•Problem Capacity 

Often, the memory or processing demands of an SCA exceed the capac- 
ity of workstations and requires multiple processors to accurately and 
efficiently solve the problem specifications. In other cases, it may be 
that the only appropriate algorithms discovered by the computational 
scientist involve code that only run on a particular server. Or, combin- 
ing these two scenarios, the only feasible solvers are specialized parallel 
algorithms on a remote distributed memory MPP system [8]. 

The SCAI must he able to reliably and efficiently execute modules on 
remote servers. 

•Interaction Levels 

Although there definitely is a place for GUIs in rapid prototyping and 
quick experimentation with SCAs, generally, these high-performance 
computing applications can take hours, days and weeks to complete a 
single task. The user may not want to keep his GUI running and defi- 
nitely does not want to be required to interact for that duration. And 
while GUIs at times provide a more convenient programming model, the 
day has not yet arrived when a graphical programming environment can 
implement a doubly nested for loop executing a module hundreds of 
times more conveniently than a scripting language. 

Scripting capabilities should be available to allow complex chaining com- 
binations of modules with no sacrifice of convenience and ease-of-use. 

•Code Maintenance and Enhancement 

The fact that the FORTRAN programming language is still popular 
after nearly 50 years implies two fundamental things: i) the laws of in- 
ertia apply to computer science as well (i.e. scientists and programmers 
are reluctant to diverge from a platform that can work for their pur- 
poses after tremendous time and efforts have been invested) and ii) it 
is extremely important for codes to be backward compatible and have a 
lifetime as long as possible. As a result, mixed-language programming 
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is very common in SCAs - hybrid applications now integrate C, FOR- 
TRAN and object-oriented languages like Java and C-I-+ to exploit the 
different advantages of each platform be it performance, maintenance, 
security or programming methodologies (or, simply, code availability). 
The SCAI must have multiple-language support and provide mechanisms 
which allow for easy replacement of components with newer, optimized, 
or simply corrected, versions without significant code modification. 

•Performance 

Last, but by no means least, SCAs demand optimal performance. It 
is the very nature of these high-performance applications to consume 
and dominate any and all resources they can access and still require 
days and weeks of computational time to complete. Data sets can easily 
extend into the hundreds of megabytes and gigabyte range for a single 
application, and this too needs to be considered, especially when data 
transfer to remote servers are involved. 

The design of the SCAI must be such that it readily improves the perfor- 
mance of applications with minimal overhead due to component interac- 
tions. 



3. THE NETSOLVE SYSTEM 

We now briefly discuss the architecture and major components of 
the NetSolve system to provide background for the discussion to fol- 
low. NetSolve is being developed at the University of Tennessee’s In- 
novative Computing Laboratory. Its original motivation was to alle- 
viate the difficulties that domain scientists encounter when trying to 
locate/install/use numerical software, especially on multiple platforms. 
Today, the system has evolved into much more than a way to access nu- 
merical solver routines. NetSolve provides an environment that monitors 
and manages computational resources, allocating the services they pro- 
vide to NetSolve-enabled client programs. Built upon standard Internet 
protocols, like TCP /IP sockets, it runs on popular variants of the UNIX 
operating system, and parts of the system are available for the Microsoft 
Windows platforms. Further documentation of NetSolve along with 
source code for the full product are available at icl.cs.utk.edu/netsolve. 
NetSolve has been recognized as a significant effort in research and de- 
velopment, and was named in R&D Magazine’s top 100 list for 1999. 

Figure 4 shows the infrastructure of the NetSolve system and its re- 
lation to the applications that use it. NetSolve and systems like it are 
often referred to as Grid middleware; this figure helps to show why. The 
shaded parts of the figure represent the NetSolve system. It can be seen 




46 ARCHITECTURE OF SCIENTIFIC SOFTWARE 



that NetSolve acts as glue layer that brings the application or user to- 
gether with the hardware and/or software it needs to complete useful 
tasks. 





Metacompndng 
Resources 



Figure \ Architectural Overview of the NetSolve System 



At the top tier, the NetSolve client library is linked in with the 
user’s application. Through NetSolve’s application programming inter- 
face (API), client-users gain access to aggregate resources without need- 
ing knowledge of computer networking or distributed computing. In fact, 
the user does not even have to know remote resources are involved. Net- 
Solve currently supports the C, FORTRAN, and Matlab programming 
languages as interfaces of implementation for client programs. 

The NetSolve agent represents the gateway to the NetSolve system. 
It maintains a database of NetSolve servers along with their capabili- 
ties (hardware performance and allocated software) and dynamic usage 
statistics. It uses this information to allocate server resources for client 
requests. The agent uses this information to allocate server resources 
for client requests, finding the server that will service the request the 
quickest, while keeping a balanced load amongst its servers and keep 
track of failed ones. Requests are directed away from failed servers. 

The NetSolve server is the computational backbone of the system. It 
is a daemon process that awaits client requests. The server can run on 
single workstations, clusters of workstations, symmetric multi-processors 
or machines with massively parallel processors. A key component of the 
NetSolve server is a source code generator which parses a NetSolve prob- 
lem description file (PDF). This PDF contains information that allows 
the NetSolve system to create new service modules thereby enhancing 
the server’s capabilities. 
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4. NETSOLVE AND SCIENTIFIC 
COMPUTING 

The NetSolve system, as mentioned in the previous section, was 
designed specifically to accommodate the computational scientist. This 
section compares the system with the requirements of the scientific 
computing community as discussed in Sect. 2 by placing NetSolve’s 
features alongside the corollaries of that discussion. 

• Corollary 1: The SCAI should provide an easy programming model 
and API for novice non-computer scientists. 

The NetSolve system allows for client users to embed functions from 
practically any software library into their applications without having 
to install, learn or maintain that package. Fig. 5 shows an example 
Matlab code, before and after the NetSolve API has been integrated. 
The code on the left hand-side is making a call to a Matlab native 
function to multiply two matrices A and B and store the result in C. 
The call to NetSolve, on the right hand-side of the figure, achieves 
the same result via the NetSolve framework. This example shows 
how NetSolve can be used to provide access to complicated software 
modules without requiring expert interactions of the user. Apart from 
module encapsulation, it allows one to create uniform interfaces of 
difierent packages of similar algorithms. For instance, the intricacies 
of sparse solvers like PETSc [6], AZTEC [9] and others can hidden by 
a single common interface that takes an additional parameter defining 
the software package to use. 



— 


A 


A = load(input_matrixl); 


A = load(input_matrixl); 


B = load(input_matrix2); 


B = load(input_matrix2); 


C = matmul(A, B); 


C = netsolve(‘matmur, A, B); 







Figure 5 Matlab code: Left side before NetSolve, right side after NetSolve integration 



• Corollary 2: The SCAI must provide ways to easily incorporate 
largely varied types of computational modules. 
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NetSolve provides a code generator that parses a NetSolve PDF in 
order to extend the servers’ functional capabilities. Figure 6 shows a 
segment of a PDF that was used to integrate a module from a sub-surface 
fluid simulator. The PROBLEM parameter of this flle deflnes the 
name we want client applications to use when referring to this module. 
The INCLUDE and LIB directives are used in the compilation of the 
module. Among other things, this PDF eventually describes the code 
that determines how to call the simulation code with the inputs given 
from a client program. The PDF facility provides a way for NetSolve to 
seamlessly integrate any type of computational modules. Furthermore, 
we have implemented a Java GUI that maJces this process even easier. 



©PROBLEM ipars 
©INCLUDE "ipars.h" 

©LIB /home/user/lib/libipars.a 
©DESCRIPTION 

Parallel Sub-Surface Flow Simulator 
©INPUT 2 

©OBJECT STRING CHAR model 
IPARS physical model to use 
©OBJECT FILE CHAR infile 
Input data file 



Figure 6 Sample PDF used to integrate new modules into the NetSolve system. 

• Corollary 3: The system should adapt to varying granularities of 
computational modules. 

While the PDF facility does not concern itself with the granularity of 
the computational module being integrated, an issue in any distributed 
environment is how the amounts and sizes of data transport affects 
performance. The first way NetSolve attempts to deal with this issue 
is by analyzing network bandwidths and latencies to choose the most 
conveniently located server resources to solve client requests. The 
second way we have optimized data communications is by creating an 
interface and infrastructure that allows a user to group or sequence a 
collection of NetSolve requests [10]. The system then analyzes the input 
and output parameters amongst all requests and caches common data 
near the relevant servers. Figure 7 illustrates the typical transactions 
that take place during a series of NetSolve requests by a single client. 
The important points to note are that parameter A is shared as an 
input for the first and second requests. Also, output parameters C and 
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D serve as inputs for subsequent requests. Figure 8 shows the reduction 
in data flow that occurs when the sequencing mode is employed. 




Figure 1 Client-server interactions 
during a typical request scenario. 




Figure 8 Client-server interactions 
during a “request sequence” . 



• Corollary 4: Complexities and intricacies of the underlying compu- 
tational modules must be hidden, yet the modules must be conveniently 
accessible. 

Through NetSolve, users are given access to complex algorithms that 
solve a variety of types of problems, one instance being linear systems 
solvers. All solvers, however, are not built alike; depending on the char- 
acteristics of the system being solved some perform poorly and others 
not at all. NetSolve has incorporated a large number of both direct and 
iterative solver algorithms for sparse/dense, symmetric/non-symmetric 
systems. To allow non-expert users to properly and efficiently use 
these algorithms without climbing the steep learning curve that would 
otherwise be involved, we have created an interface that allows them 
to generically call a “LinearSolve” routine which transparently analyzes 
the input matrix and determines which algorithm to use based on input 
characteristics. [11] further discusses this interface and the heuristics 
and decisions that are involved in the algorithm selection process. We 
envision using similar heuristics for classes of problems other than linear 
system solvers in the future. 

• Corollary 5: The SCAI should reliably execute modules on remote 
servers. 

By its very deflnition, NetSolve is a distributed computing environ- 
ment that allows for remote problem execution. Its failure detection 
and fault-tolerant mechanisms allow the system to detect servers 
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that have failed to solve particular problems or server hosts that are 
non-responsive and direct new requests away from these resources. 
During a computation, the system attempts to use every appropriate 
and capable server host (from best to worse (see Corollary 8)) until 
a problem has been solved or the list of servers has been exhausted. 
Other investigations are leading to the development of heuristics to 
checkpoint NetSolve services so that a mid-service failure does not 
result in a lost of all previous computation. These checkpoints will be 
used to migrate the state of the interrupted service to other NetSolve 
servers where computation will resume. 

• Corollary 6: Scripting capabilities must be available to allow users 
to chain complex combinations of modules. 

All elements of the NetSolve system are accessible via the API of the 
client libraries. Using the functions of this API, users can embed calls 
to NetSolve in compiled languages, like C or FORTRAN, or interpreted 
languages, like Matlab and Mathematica. The nature of these environ- 
ments make it possible for the user to invoke NetSolve in as simple or 
as complex a way as possible. The asynchronous interface further allows 
the user to make non-blocking request to NetSolve. The call returns im- 
mediately with a handle to the request that the user can use to probe to 
see if the request has completed and retrieve the results. This interface 
allows users to invoke multiple calls to NetSolve that would then run on 
different hosts, further improving application turnaround time. 

To aid in the development of SCAs that do Monte Carlo simulations, 
parameter sweeps and other applications with simple task-parallel 
structures, i.e. independently parallel programming, we have created 
a task farming interface. The task farming interface allows the user 
to make a single call to netsolve, requesting multiple instances of the 
same problem. Rather than single parameters, the user passes arrays 
of parameters as input /output and designates how NetSolve should 
iterate across the arrays for the task farm. The main challenge in this 
effort is scheduling. Indeed, for long running farming applications it is 
to be expected that the availability and workload of resources within 
the server pool will change dynamically. [12] discusses the design of 
this infrastructure and also presents an adaptive scheduling algorithm 
used by the task farming interface to assign tasks to the server resources. 

• Corollary 7: The SCAI must be have multiple-language support and 
its components should be completely pluggable. 
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As already mentioned, the NetSolve API has been implemented in 
many programming environments. Further motivated by the need to 
support various platforms (among other things), we have implemented 
client proxies to act on behalf of the Netsolve client. The proxy, a 
separate process that resides on the client host, handles (almost) all 
interactions with the underlying meta-computing resources. With a 
standard interface between the client and all proxies, it is possible, 
especially for third party developers, to easily add new language support 
to the NetSolve system. They would simply write libraries that interface 
the NetSolve proxies from their language of choice, allowing programs 
of that language to become NetSolve-enabled. Figure 9 depicts the 
main idea behind the proxy. The client libraries interact with the 
proxy thanks to a standard API and the proxy interacts with the 
meta-computing system using system-specific mechanisms. This allows 
NetSolve-enabled clients to leverage other meta-computing resources 
apart from those provided by NetSolve. The NetSolve proxy, for 
instance, uses the agent to discover services, contacts the appropriate 
server and establishes a session with that server who then receives input 
data from the client, executes his service and return output data. 




Figure 9 Proxy Architecture 

• Corollary 8: The design of the SCAI must optimize performance 
with minimal overhead to the resources it occupies. 

The NetSolve agent uses both static and dynamic information from 
the servers to assign requests to the best server at that point in time. 
The algorithms for each service is configured with a complexity that de- 
scribes the computational time of the algorithm based on input sizes. 
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Performance is measured by the LINPACK benchmark upon server ini- 
tiation, and the server monitors its host reporting workload information. 
All these parameters, along with network bandwidth and latency infor- 
mation, are used whenever a request is received to rank the appropriate 
servers from best to worst. This list is sent to the client who then uses 
the fastest server to solve his problem in order to optimize performance. 

The major drawback in distributed systems is often data transfers. 
Though not yet implemented, under consideration are heuristics to con- 
sider if performance might be improved by using the client’s host as a 
server by uploading the necessary software once, rather than transferring 
data to a server (on possibly several occasions). 

5. CONCLUSION 

Scientific Computing Applications are a prominent part of the field of 
Computer Science. These applications and the computational scientists 
that implement them have unique characteristics that should forge the 
software infrastructures needed to support their development. We have 
discussed the relevant characteristics they possess and present NetSolve, 
an environment for remotely solving computational problems, as a sys- 
tem that hopes to address these issues. The system is a work-in-progress, 
and at the heart of this effort lies the philosophy that convenient inter- 
faces and ease of administration are most important; every effort is made 
not to sacrifice these elements as the system evolves to meet the needs 
of the scientific computing community. 
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DISCUSSION 

Speaker: Jack Dongarra 

Morven Gentleman : The description of NetSolve for naive users 
sounds attractive when it works. What have you done about feedback to 
naive users when something goes wrong? For example, if the computing 
grid is international as described, the error handling system must be 
internationalized in the classical sense so that the error information is 
presented in the local language of the reader. 

Jack Dongarra : At this point NetSolve only returns to the user the 
error condition or flags raised by the library call. 

Brian Smith : What effort have you made to support robustness with 
respect to hardware, network, and software failures to solve the requested 
problem? 

Jack Dongarra : The NetSolve system ensures that a user request 
will be completed unless every single resource capable of servicing the 
request has failed. When a client sends a request to a NetSolve agent, 
it receives a sorted list of computational servers to try. When the client 
has succeeded in contacting one of these servers, the numerical compu- 
tation starts. If the contacted server fails during the computation, then 
another server is contacted and the computation restarts. In the current 
implementation, all input data resides on the user’s machine during the 
computation and can then travel to another server after a failure. Future 
implementations will hand off the input data to NetSolve storage servers 
where it will stay until the computation has finished (after possible fail- 
ures and restarts). The whole restart process is transparent to the user. 
This fault-tolerance mechanism is rather primitive (even though effec- 
tive), and we are investigating techniques to allow task-checkpointing 
among servers. 

David W. Walker : This question relates to data persistence. In a 
sequence of NetSolve calls in which the output from one call is used as 
the input to a subsequent call, can the data be manipulated without 
sending it back to the client after every call? 

Jack Dongarra : We have a mechanism called request sequencing that 
provides for data persistence on the servers. With request sequencing 
a user can group together requests for grid services through NetSolve 
to exploit some common characteristics of these requests and minimize 
network traffic. Request sequencing can be used to implement affec- 
tive scheduling policies and enable more expedient resource allocation 
methods. 

Ivor Philips : Different computers may have put up different ver- 
sions/releases of the same software. How can I, as a user, know that I 
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have used the most current version? This seems to point to a serious 
grid management issue. 

Jack Dongarra : With NetSolve there is an audit trail generated for 
a call so a user can see what routine/ version/machine was used in the 
invocation. The system does not provide the ability to have the user 
select a specific version. 

Anne Trefethen : How much help does NetSolve offer the user in 
choosing an algorithm? For example, the Matlab interface shown, 
i.e., NetSolve (^itmeth\ ^PetSc\ A, b), doesn’t appear to indi- 
cate which solver and preconditioner should be used. Does NetSolve 
make that choice? 

Jack Dongarra : We are developing a client side system that will 
probe the data and try to figure out what library is the “best” to use. 
This, together with what is available and expected time to solution, will 
determine the actual execution. 

Milind Bhandarkar : What work has been done in the context of 
NetSolve about cost models and licensing models? 

Jack Dongarra : The NetSolve system does not have a cost model to- 
day (although it does have hooks to do accounting). Clearly some cost 
model is needed to account for the cycles and software that is used, but 
today our experiments have focused on building mechanisms to use soft- 
ware and hardware in the grid. Licensing is a bit more tricky since most 
existing agreements have not taken into account the grid environment. 
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Abstract We consider two software packages that interact with each other as 
components: Overture and PETSc. An interface between these two 
packages could be of tremendous value to application developers in that 
Overture provides a simple mechanism for generating the large, sparse 
systems of linear equations that correspond to discretizations of a PDE, 
and PETSc provides a powerful collection of methods for solving these 
systems. Two types of interfaces are discussed: the internal interface 
between components, and the external interface for the application de- 
veloper. We compare three basic approaches to developing the internal 
interface between Overture and PETSc, the final one of which is a peer- 
to-peer model. 

Keywords: components, interface, PETSc, Overture, peer-to-peer interaction 

1. INTRODUCTION 

A complete application to numerically solve a partial differential equa- 
tion (PDE) and analyze the results typically involves: a mechanism for 
creating a grid; a scheme for calculating spatial derivative approxima- 
tions; a method for time advancement, which may require use of linear 
algebra routines including scalable linear and nonlinear equation solvers; 
another mechanism for visualization and analysis of the data; and, since 
the application may be performed on a parallel computer, routines for 
communicating data between processors. With our expectations of soft- 
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ware rising as the capabilities of computers increase, writing a good 
implementation of any one of these tasks requires significant expertise. 
Indeed, the expectation that one person, or one group, could write an 
entire package for such a general-purpose tool is unreasonable. To create 
such an application, we must take advantage of the expertise of several 
persons or groups, each focusing on one component of the full applica- 
tion. We can then consider a framework in which these components can 
be linked together. (In this paper, we shall use the terms “component” 
and “framework” in a general sense as opposed to the specific defini- 
tions as set forth by the Common Component Architecture Forum [1] 
and other organizations.) 

Before creating such a framework, however, we must leaxn what makes 
a good component. We can gain this insight by looking at successes 
and failures of various projects that have attempted interaction between 
software packages. In this paper, we shall consider two software packages 
that interact in this manner. Overture [5, 6] and the Portable Extensible 
Toolkit for Scientific computation, PETSc [2, 3, 4]. 

Overture is a collection of C-l— I- classes that provide tools for solving 
PDEs. It contains a tool for generating composite grids (i.e., lists of 
structured grids that overlap) and a wide variety of operators of varying 
accuracy for computing derivatives via finite difference, finite volume, 
and spectral methods on these grids. Overture is also extensible in that 
application developers can create their own sets of operators. Further- 
more, several equation solver libraries can be used within Overture, and 
new equation solvers can be added. 

PETSc is a scalable library for the solution of PDEs and related prob- 
lems. With PETSc, one can create complete applications, as one might 
within Overture, but the emphasis has been on generating a collection of 
solvers for linear and nonlinear systems of equations as well as lower- level 
infrastructure for managing the details of parallel programming. 

An interface between these two packages could be of tremendous value 
to application developers in that Overture provides a simple mechanism 
for generating large, sparse systems of equations, and PETSc provides a 
powerful collection of methods for solving these systems. Prom the Over- 
ture developer’s perspective, the obvious mechanism for this interface is 
via a PETSc equation solver class within Overture. The development 
of such a class is still ongoing, but much can be learned about how to 
write useful components by observing this work in progress. 

Two types of interfaces shall be discussed: the internal interface be- 
tween components, and the external interface which the application de- 
veloper will use. Three basic approaches toward developing the internal 
interface between Overture and PETSc have been explored. The first 
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approach is to have Overture convert its native data structures into those 
that the Overture developers expect to be appropriate for linear algebra 
purposes, and to require any linear algebra solver to support these for- 
mats. The second is to force the linear algebra solver (PETSc) to use the 
native Overture data structures as vectors and matrices. The third ap- 
proach is to have both Overture and the linear algebra component work 
together to convert Overture’s native data structure into whichever data 
structure the linear algebra component recommends. For the external 
interface, one must balance simplicity with flexibility, to allow the user 
to develop high-performance applications without needing to learn new 
interfaces for well-known tasks. 

2. THE INTERNAL INTERFACE 

When working with multiple software components, the principal bar- 
rier toward interaction is related to the data structures involved. The 
best data structure for one particular task, and component, is not the 
best data structure for another. Clearly, all components cannot be ex- 
pected to use the same data structure. An interface must be generated 
to determine the interaction between the components at the level of the 
data structures. This interface is one that, if properly implemented, is 
used by component developers and is not essential for the application 
developer to use directly. In this section, three approaches toward this 
internal interface are discussed. 

2.1. OVERTURE CONVERTS DATA 

STRUCTURES TO “STANDARD” 
VARIETIES 

The first approach is for Overture to require linear algebra solvers 
to support specific matrix and vector data structures that are common 
among linear algebra toolkits. In particular, contiguous one-dimensional 
arrays are used to store vectors. For storing matrices, the compressed 
and uncompressed sparse row formats as well as their column-based 
counterparts are currently supported. Overture provides a mechanism 
for the conversion into and out of these specific data structmes. These 
data structures are created and destroyed by Overture, but since they 
are supported by other components, those components are free to ma- 
nipulate them as they see fit. 

From PETSc’s perspective, this approach is acceptable for vectors but 
has only limited benefit for matrices. In particular, PETSc implements 
a (sequential) sparse matrix, which is entirely owned by one process (or 
duplicated across all processes), using the compressed sparse row for- 
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mat. But, if this matrix is to be distributed across several processors, 
the data structure must be changed rather dramatically to achieve high 
performance. Similarly, if a certain block structure is present in the 
matrix, PETSc prefers a blocked variant of the compressed sparse row 
format. The use of this blocked variant allows for many fewer cache 
misses and register loads, resulting in vastly superior performance. Nei- 
ther of these two matrix formats is directly supported by Overture, so 
to accommodate these data structures, a second conversion would be 
required. 

Prom the Overture developer’s perspective, there is only one proce- 
dure for data conversion for vectors. This is quite easy to develop and 
maintain. Matrices are another story, however. Overture currently sup- 
ports four data structures. PETSc supports nine matrix formats cur- 
rently, with four more under development, but only one of these formats 
is supported by Overture. In fact, a draft of the Basic Linear Algebra 
Subprogram (BLAS) standard proposed 13 sparse data structures to be 
supported, including those supported by Overture, and only four which 
are supported by PETSc [7]. One could expect Overture to provide rou- 
tines for conversion into and out of each of these additional varieties, but 
where would the list end? Furthermore, as new software packages are 
developed, new data structures with increasing complexity are bound 
to arise. To deal with this problem, PETSc allows additional formats 
that are user defined; the user can provide new data structures and over- 
load the basic PETSc functions with appropriate implementations. How 
would Overture deal with these additional types? 

2.2. PETSC SUPPORTS NATIVE OVERTURE 
DATA STRUCTURES 

The presence of a user-defined type within PETSc suggests a differ- 
ent approach to the interface. The PETSc matrix and vector operations 
could simply be overloaded to use the data structures for vectors and 
matrices that are defined by Overture. This approach has been used suc- 
cessfully to interface PETSc with the Structured Adaptive Mesh Refine- 
ment Applications Infrastructure (SAMRAI) [9, 10], and other packages. 
It has the advantage that there is no performance overhead associated 
with copying elements between data structures in terms of memory or 
CPU usage. 

Two sources of difiiculty are associated with this approach, however. 
First, the linear equation solvers within PETSc are primarily based on 
Krylov subspace methods. In these methods, the most fundamental op- 
eration between matrices and vectors is the matrix- vector product. Un- 
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fortunately, this operation is fai; less efficient when implemented using 
data structures that have not been optimized for linear algebraic oper- 
ations as has been done with the PETSc data structures. The Overture 
data structures, although well optimized for PDE discretization, are not 
optimized for linear algebraic operations. As a result, the cost of copy- 
ing the data into a PETSc data structure once per solution of a linear 
system of equations is far less than the overhead associated with using 
the Overture data structures for the matrix-vector product. This alone 
would be sufficient for not choosing this approach, but there is another 
challenge facing this approach. 

Overture recognizes two basic linear algebra classes, vectors and ma- 
trices. In PETSc however, there is a third basic class, preconditioners. 
In order to achieve high performance when solving the large, sparse lin- 
ear systems of equations generated by the discretization of PDEs with 
Krylov methods, preconditioners are essential. Since Overture has dele- 
gated the creation of preconditioners to the linear algebra component, if 
this approach is to be used, the mechanisms for interoperation of PETSc 
preconditioners with vectors and matrices defined by non-PETSc data 
structures must be understood. 

To achieve high performance when solving systems of linear equations, 
many of the PETSc preconditioners place certain functionality require- 
ments on matrix and vector data types. These attributes include the 
ability to extract the diagonal (block) elements of the matrix and to 
solve systems of equations with the locally owned portion of the matrix. 
Furthermore, it is assumed that one can obtain a pointer to a (locally 
owned) contiguous data array for each vector type. Since the Overture 
data structures are not stored locally as a contiguous array, the vector 
data must be copied into this format for use with many of the PETSc 
preconditioners. As a result, if PETSc users provide their own storage 
formats for matrices and vectors, they often provide their own precon- 
ditioners. As this is not an option for the Overture developer, we return 
to the concept of conversion between data structures. 

2.3. PEER TO PEER INTERACTION 

No developer can learn every possible data structure that could be 
used for matrix and vector storage and then provide all possible con- 
version routines. As a result, a two-step conversion process was used 
in early releases of Overture. However, since the Overture developer 
knows its data structure, and the linear algebra component has intimate 
familiarity with its own formats, the two packages should be able to co- 
operate and together carry out the conversion process in a more direct 
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manner, despite the complexity of the exact process involved. To do so, 
we employ nontraditional approach. 

The tradition in scientific computing software has been to gather 
groups of developers together and have them discuss data structures and 
interfaces. The intended result is a standard that other software pack- 
ages can use; eventually, high-performance implementations based upon 
these standards could be developed and used by all. This is the model 
that was used when generating the BLAS and LAPACK standards. In 
the realm of dense linear algebra, the resulting implementations have 
enjoyed a great deal of success. But this is not the case for large, sparse 
systems generated from the discretization of PDEs as the sparse matrix 
standard has yet to be finalized. Furthermore, the current draft of the 
BLAS Standard for sparse matrices does not specify the underlying im- 
plementation of the sparse matrix, leaving that decision to the author 
of the particular BLAS implementation [8]. 

Instead of relying on the existence of a standard data structure, a 
generic converter can be created. To do so, one must remove the re- 
sponsibility for generating the matrix and vector data structure from 
Overture, and share it with the linear algebra component. However, 
this linear algebra component cannot be expected to know how to prop- 
erly traverse a data structure that it does not know. A compromise 
must be found; each component can be expected to perform only the 
tasks it knows how to perform. This means that during the conversion 
between the two data structures. Overture would provide two services; 
size information and traversal path; and the linear algebra component 
would provide two additional services: new data structure allocation and 
element definition. 

In particular. Overture would provide generic linear algebra informa- 
tion such as the global and local matrix dimensions and a bound on 
the number of nonzero elements in each row of the matrix. This infor- 
mation would then get passed to the equation solver component via a 
routine called AllocateMatrix. PETSc and other components would then 
provide a specific implementation of AllocateMatrix that uses this in- 
formation to generate an empty matrix data structure. Overture would 
also provide a mechanism for walking through its data structures while 
making calls to another routine, called SetMatrixElement, and each com- 
ponent would implement this routine in the most suitable manner. The 
implementation is quite simple in C-l— t-. Each equation solver subclass 
is derived from a base class that has the two required functions declared 
as virtual. 

This approach raises several issues related to the efficiency of the con- 
version process. The SetMatrixElement method would need to be able 
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to insert an element into any location in the matrix. This might require 
a significant amount of alteration to the data structure or might involve 
communication between processors. Clearly, some sort of aggregation 
process should be allowed, in which case a SetMatrixRow could help. 
But, this would not be of use if the matrix were stored in a column- 
based format. A generic SetMatrixElements would clearly be prefered. 
To allow for greater aggregation, the component developer might also 
employ a stash based approach toward the setting of element values. In 
this approach, the elements initially get set into a private stash (perhaps 
a linked list), which would then get manipulated and converted into the 
final format. As a result, another virtual operation, AssemblyFinalize, 
should be added to the base equation solver class to facilitate this second 
step. 

This has introduced a second non-traditional characteristic of this 
conversion process. In many linear algebra libraries, there is no concept 
of an invalid matrix. That is to say, once a matrix is created, it can 
be used. The stash based approach to setting elements in a matrix re- 
quires that this not be the case. By setting a single element, the matrix 
data structure becomes invalid, since information about the matrix it- 
self is located in the temporary stash. The matrix is then made valid 
by performing an AssemblyFinalize step after additions to the stash are 
complete. It would be possible to hide the explicit call to AssemblyFi- 
nalize from the end user by placing this call within each operation that 
requires a valid matrix data structure, but there is a penalty for such an 
approach. 

In PETSc, when adding elements to a matrix, a stash is used [2]. For 
parallel matrix formats this provides one particularly important benefit, 
elements can be added in one process that are to be stored as part of 
the local matrix in a different process. To allow the application devel- 
oper to overlap the required communication with computation, PETSc 
divides the process into two stages: MatAssemblyBegin, which initiates 
the communication, and MatAssemblyEnd which terminates communi- 
cation and performs the final data structure assembly. If there were no 
concept of an invalid matrix in PETSc, or if this concept were hidden 
from the application developer, the idle time that occurs during these 
communication stages could not be used for useful computation. 

This mechanism is quite useful for achieving high performance and 
could be incorporated not only by dividing the Assembly Finalize into 
two pieces, but by dividing the entire conversion process into two paxts. 
MatrixConversionBegin would contain allocation of memory for the ma- 
trix data structure in the equation solver, and traversal of the Overture 
data structure while setting each element (or row/column) of the ma- 
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trix, meiking the matrix invalid. MatrixConversionEnd would assemble 
the valid matrix and perhaps provide some useful debugging options to 
verify that the valid matrix was assembled properly. 

3. THE EXTERNAL INTERFACE 

Finally, we must determine how to appropriately encapsulate the in- 
terface information to generate a well-defined minimum interface layer 
as well as additional convenience layers. The question to be addressed 
is one of exposure: How much is enough? Should an application devel- 
oper be expected to know the entire application-programming interface 
(API) for each individual component involved? That is to say, should an 
Overture user be expected to know all the details of how to use PETSc? 
Clearly, the answer is no; but the same question should be asked about 
a PETSc user. Furthermore, knowledgeable PETSc users should not 
feel as though they are restricted when using Overture to generate their 
systems of equations, nor should they be required to learn a completely 
new API for the PETSc aspects of their application. The minimum 
layer of interface must be found and adequately described, while other 
convenience layers may be provided so that multiple APIs do not need 
to be learned. 

In this area, a “least common denominator” interface makes some 
sense. An Overture user with no experience with PETSc, or other equa- 
tion solvers, would have access to a very basic set of commands: Build- 
Matrix, BuildRHSAndSolution Vector, and Solve. A common interface 
for selecting various linear equation solvers such as GMRES and CG 
and preconditioners such as ILU and Jacobi would provided, as well as 
mechanisms for selecting the difierent data structure conversion options. 
But, this alone would be a gross limitation for expert users of PETSc 
who wish to use their own advanced equation solvers and preconditioners 
built within PETSc. 

Similarly, the Overture user is primarily interested in finding a numer- 
ical solution to a PDE subject to a certain discretization. Many methods 
for solving time dependent PDEs do not require solving a system of lin- 
ear equations but do require solution of a nonlinear system of equations. 
Many linear algebra components do not provide a mechanism for this, 
but PETSc does. It would be a great disservice to application developers 
if this additional capability within PETSc could not be exploited. 

To achieve this end, the PETSc matrix and vector data structures are 
public members of the class PETScEquationSolver. In this manner, once 
the BuildMatrix and BuildRHSAndSolution Vector routines have been 
called, these data structures can be extracted, and the full PETSc API 
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can be used without limitation, if desired. Furthermore, an Overture- 
enhanced version of the PETSc API for solving nonlinear systems has 
also been generated to simplify this process, which may be incorporated 
into a PETScNonlinearEquationSolver class in future versions of the 
PETSc-Overture interface. 

4. LESSONS LEARNED 

Prom this work, we have identified various guidelines that a com- 
ponent developer should follow. For a component to be used, it must 
interact with other components. Common approaches to enable this in- 
teraction have been to mandate adherence to a standard data structure 
or to provide of a single data structure for all to use. For the solution of 
large, sparse systems of linear equations, these approaches do not allow 
the user sufficient flexibility to obtain high performance. This limita- 
tion in turn discourages the use of that component. One must accept 
the fact that data structures as well as the algorithms that use those 
data structures determine performance. To allow for the use of a wide 
variety of data structures, a data conversion process is required. For 
conversion processes to be successful, both sides in the conversion pro- 
cess need to cooperate. An appropriate division of tasks and assignment 
of responsibility is essential. In general, this division of tasks must al- 
low each component to provide the services and information that it can, 
and must not require knowledge beyond the scope of the component. 
By specifying methods to be used and not implementations, we allow 
component writers the freedom to implement the highest performance 
data structures for their specific task, without placing limitations on the 
interaction with other components. 

When making a component, two aspects to the interface require at- 
tention: the external interface detailing what it can accomplish, and the 
internal interface to other components. It is easy to neglect the second 
while focusing on the first. But, doing so can make the component much 
less powerful. 

Ultimately the success of the component lies in the answer to one 
question: Do people use it with other components? 
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DISCUSSION 

Speaker: Kristopher Buschelman 

Margaret Wright : One of your conclusions is that component writers 
should specify methods and not implementations. How precisely do you 
mean to distinguish these? 

Kristopher Buschelman : In simple terms, implementations are the 
data structures and the operations that are specific to those data struc- 
tures, whereas methods describe the functionality which is more ab- 
stract, and not data structure specific. Each method will have a vari- 
ety of implementations. Some examples of methods for sparse matrices 
would be matrix- vector multiplication, ILU factorization, etc. An imple- 
mentation specific interface layer could include a routine which returns 
an array containing all of the nonzero values of the matrix. 

Margaret Wright : If the method is specified at too high a level of 
abstraction, it becomes, at some level, meaningless. For example, certain 
sparse matrix methods implicitly suggest certain data structures; hence, 
specifying them inherently leaxis to specifying an implementation. 
Kristopher Buschelman : Certain interfaces to sparse matrix imple- 
mentations suggest certain data structures. These interfaces are nice in 
that if you are using those data structures particular optimizations can 
be performed. However, these prove limiting if the data structures used 
by the different components are not compatible. A more general inter- 
face needs to be provided to allow for these cases. These data structure 
interfaces exist at too low of a level for this type of component interac- 
tion. 

For example, SAMRAI has a hierarchical data structure to store vec- 
tor coefficients. PETSc’s VecGetArray routine, which is required to 
return a contiguous array of vector coefficients, proved to be such a 
challenge in the interface between PETSc and SAMRAI, this routine is 
simply not implemented for the SAMRAI type of PETSc vector. For- 
tunately, this routine is primarily required for matrix-vector products 
using certain of PETSc’s matrix implementations, which are never used 
within the SAMRAI framework. PETSc allows SAMRAI to use a matrix 
type which specifies its own implementation of this product. 
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Abstract We are developing scientific software component technology to manage 
the complexity of modern, parallel simulation software and increase the 
interoperability and re-use of scientific software packages. In this paper, 
we describe a language interoperability tool named Babel that enables 
the creation and distribution of language-independent software libraries 
using interface definition language (IDL) techniques. We have created 
a scientific IDL that focuses on the unique interface description needs 
of scientific software, such as complex numbers, dense multidimensional 
arrays, and parallel distributed objects. Preliminary results indicate 
that in addition to language interoperability, this approach provides 
useful tools for the design of modern object-oriented scientific software 
libraries. We also describe a web-based component repository called 
Alexandria that facilitates the distribution, documentation, and re- 
use of scientific components and libraries. 

Keywords: component technology, language interoperability, software repository, 
parallel high-performance scientific software 

1. MOTIVATION 

Numerical simulations play a vital role as a basic research tool for 
understanding fundamental physical processes. As simulations become 
increasingly sophisticated and complex, no single person — or even single 
institution — can develop scientific software in isolation. Development 
teams rarely possess sufficient resources and scientific expertise in all re- 
quired domains to successfully create a complex application from scratch. 
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Instead, physicists, chemists, mathematicians, and computer scientists 
concentrate on developing software in their domain of expertise. Com- 
putational scientists create simulations by combining these individual 
software pieces. 

In collaboration with the Common Component Architecture forum [1], 
we are developing software component technology for high-performance 
parallel scientific computing. The goal of this effort is to improve the 
software development processes of scientific codes by using proven tech- 
niques and technology from industry. Component technology addresses 
technological barriers to software re-use and integration, such as incom- 
patibilities in programming languages, interface descriptions, and phys- 
ical deployment. By removing such barriers, component approaches will 
allow computational scientists to concentrate on building more sophisti- 
cated numerical simulations and reduce effort wasted integrating incom- 
patible software. 

In this paper, we describe our recent work in two areas of component 
technology: language interoperability and a component repository. As 
part of our language interoperability efforts, we are developing a tool 
called Babel to enable the creation and distribution of language inde- 
pendent software libraries. To use Babel, library developers describe 
their software interfaces in a Scientific Interface Definition Language 
(SIDL). Babel uses this SIDL interface description to automatically 
generate “glue code” that enables the software library to be called from 
any supported language. We have also designed and implemented a 
prototype web-based repository called Alexandria to encourage the 
distribution and reuse of scientific computing software components and 
libraries. Alexandria provides a convenient web-based delivery system 
and thus lowers the barrier to adopting component technology. 

This paper is organized as follows. Section 2 surveys component tech- 
nology approaches for scientific computing and discusses related work. 
Section 3 discusses our language interoperability approach, modifications 
necessary for the scientific domain, the Babel tool, and experiences us- 
ing Babel in a high-performance scientific software library. Section 4 
introduces the Alexandria web-based component repository and its 
implementation architecture. Finally, Section 5 summarizes the con- 
tributions of this work and discusses future research directions for the 
scientific component community. 




Component Technology for Scientific Software 71 



2. SCIENTIFIC COMPONENT 
TECHNOLOGY 

Component technology [25] is an extension of object-oriented software 
technology that focuses on the issues of software interoperability and re- 
use. Component technology provides language independence, compiler 
independence, and seamless access to distributed object resources. Com- 
ponent technology is more than object-oriented approaches, software 
modules, scripting [3, 4], or software frameworks [7, 8, 10, 14]; however, 
component approaches do make use of these other related technologies. 
A software framework may be created within a component architecture 
to address a particular application domain. Scripting languages may be 
used as an integration language to connect existing software components. 

Industry has created component technology to address issues of in- 
teroperability due to different programming languages, the complexity 
of applications developed using third-party software, and the incremen- 
tal evolution of large legacy software. There are three common com- 
ponent technology standards in the business community: COM [12], 
JavaBeans [24], and CORBA [19]. COM is Microsoft’s component stan- 
dard that forms the basis for interoperability among all Windows-based 
applications. Microsoft recently introduced a new component initiative 
called .NET [18] that combines ideas from COM and Java and will likely 
be the future of Microsoft technology. Sun Microsystems has developed 
JavaBeans and Enterprise JavaBeans [23] based on the Java program- 
ming language. CORBA, by the Object Management Group (OMG), is 
a cross-platform distributed object specification that supports the inter- 
action of complex objects written in different programming languages 
distributed across a network of computers. 

Component technologies such as CORBA, COM, and JavaBeans have 
been very successful in industry; unfortunately, they are designed for the 
business environment and do not address many of the issues associated 
with large-scale parallel scientific computing. For example, industry ap- 
proaches do not address data distribution support for massively parallel 
SPMD components. 

We believe that a successful component technology for scientific sim- 
ulation must address four issues: language interoperability, common 
component behavior, physical deployment standards, and support for 
distributed parallel communication. The work presented in this paper 
addresses only a small part of the overall component technology solution. 
Community collaborative work such as that by the Common Component 
Architecture (CCA) [1] forum and others is essential. In the following. 
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we review related component technology work in the scientific commu- 
nity. 

Both CORBA [19] and COM [12] address language interoperability 
through the use of an Interface Definition Language (IDL). An IDL 
describes the interface of a software component using a new descrip- 
tive language that is independent of any particular progra mmi ng lan- 
guage. We follow a similar approach in our language interoperability 
work, which is presented in Section 3. IDL technology has the advan- 
tage that, in some sense, all languages are equal, and any language may 
call any other language. The primary disadvantage of an IDL approach 
is that the developer must write a separate interface description of the 
software library and then must follow certain programming conventions 
that map the interface description into the programming language. Au- 
tomatic wrapping approaches such as SWIG [3] or SILOON [17] support 
language interoperability without requiring a separate IDL description 
but are typically limited to the case of a scripting language (such as 
Python) calling a compiled language (such as C or C++). In contrast, 
IDL approaches allow method invocations in both directions. 

Beyond language interoperability, component architectures typically 
require that all components support some common set of behaviors. 
Common behaviors are important for the discovery of component ca- 
pabilities (e.g., “What interfaces do you export?”) required by GUI 
development tools and problem solving environments [6, 13, 20]. For ex- 
ample, the CCA specification requires that all CCA components support 
the notion of a port [1]. Ports describe the interfaces used by and pro- 
vided by a component. Our IDL technology plays a role as a mechanism 
for describing component port interfaces. 

Component problem solving environments (PSEs) may also require 
standards for describing the physical deployment of component soft- 
ware. For example, CCAT [6] employs an XML [28] component deploy- 
ment descriptor that enables the PSE to understand component ports, 
port interface types, platform dependencies, and associated component 
metadata. One of the goals of the Alexandria component repository 
described in Section 4 is to provide a common repository for component 
descriptions for use by tools such as a PSE. 

Unlike industry approaches, scientific component technology must 
support communicating parallel components. In most high-performance 
applications, components will communicate within the same memory 
address spax;e, although the components themselves may be distributed 
across processor memories in a SPMD fashion. Some applications, how- 
ever, will span multiple parallel computers. For example, a large simu- 
lation running on thousands of processors may be connected to a visu- 
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alization component running on a small visualization engine with a few 
tens of processors. In this case, the component architecture must sup- 
port some form of parallel data redistribution. A number of researchers 
have addressed this issue for certain limited classes of data types. Both 
PAWS [5] and CUMULVS [16] support parallel redistribution of arrays 
and other predefined data items such as particles or simple unstructured 
meshes. PARDIS [15] and Cobra [22] support distributed sequences and 
arrays in CORBA. We and other members of the CCA working group 
are researching approaches for extending this work to more general sci- 
entific objects, but that work is preliminary and beyond the scope of 
this paper. 

3. LANGUAGE INTEROPERABILITY 
TECHNOLOGY 

Computational scientists developing large simulation codes often face 
difiiculties due to language incompatibilities among various software li- 
braries. Scientific software libraries are written in a variety of program- 
ming languages, including Fortran, C, C++, or a scripting language such 
as Python. Language differences often force software developers to gen- 
erate mediating “glue” code by hand. In the worst case, computational 
scientists may need to re-write a particular library from scratch or not 
use it at all. 

We have developed a tool called Babel that addresses language inter- 
operability and re-use for high-performance parallel scientific software. 
Its purpose is to enable the creation, description, and distribution of 
language independent software libraries. In the following sections, we 
describe our interoperability approach, the Babel tool architecture, and 
an example of using Babel in a parallel linear algebra software library. 

3.1. SCIENTIFIC IDL 

Babel addresses the language interoperability problem using Interface 
Definition Language (IDL) techniques [12, 19]. An IDL describes the 
calling interface (but not the implementation) of a particular software 
library. IDL tools use this interface description to generate “glue code” 
that allows a software library implemented in one supported language to 
be called from any other supported language. We have designed a Sci- 
entific Interface Definition Language (SIDL) that addresses the unique 
needs of parallel scientific computing. SIDL supports complex numbers 
and dynamic multi-dimensional arrays as well as parallel communication 
directives that are required for parallel distributed components. SIDL 
also provides other common features that are generally useful for soft- 
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waxe engineering, such as enumerated types, symbol versioning, name 
space management, and an object-oriented inheritance model similar to 
Java. 

As illustrated in Figure 1, SIDL becirs a close resemblance to CORBA 
and Java. The package keyword introduces a new namespace. A names- 
pace may contain a class, interface, enumerated type, or another package. 
Classes and interfaces contain methods. The methods in an interface 
are abstract; that is, they are not implemented by the interface. As 
in CORBA, in, out, and inout modify method arguments and denote 
the direction of information transfer. SIDL also supports Javadoc-style 
documentation comments, which may be used to automatically generate 
browsable documentation (see the Alexandria discussion in Section 4). 

The following sections provide additional details concerning some of 
the more unique characteristics of the SIDL interface definition language. 

3.1.1 Symbol Versioning. In SIDL, every package, enumer- 
ated type, class, and interface is assigned a particular version number. 
Every SIDL description begins with one or more version statements. 
Each version statement contains a package name and an arbitrary ver- 
sion string consisting of a sequence of integers separated by periods. All 
symbols within a package share its version number. For example, the 
version statement on the first line of Figure 1 states that all symbols 
defined in the hypre package will be version 1.0 of that symbol. A 
version statement is required for every new outermost package defined 
in a SIDL description. A version statement may also be used to give 
an explicit version number for resolving external symbols referenced in 
a SIDL description. If a version is not specified for a particular external 
symbol, then the most recent version of that symbol is used. 

Symbol versioning is an important consideration for the development 
of community- wide standards and specifications. Consider a standards 
committee that releases version 1.0 of a particular specification. Com- 
ponents will be written to and implement that version of the standard. 
When the committee releases version 2.0 of the specification, some com- 
ponents will immediately implement the new standard, whereas others 
will take longer. Versioning removes ambiguity about which version of 
the specification a particular component implements. 

3.1.2 Import. Like Java, SIDL supports a type of import 
statement. The import statement adds the specified package name to 
the symbol resolution path. For example, a SIDL description that refer- 
ences symbol Vector in package hypre could either use the fully qualified 
name hypre . Vector or begin with ” import hypre” and then simply use 
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version hypre 1.0; 

/♦♦ 

♦ A SIDL type description for the <em>hypre</em> library. 

♦/ 

package hypre { 

♦ <code>Vector</code> represents a mathematical vector. 

♦/ 

interface Vector { 

Vector clone 0; 

void scale (in double a); 

double dot (in Vector x) ; 

void axpy(in double a, in Vector x) ; 

int getGlobalDimensionO ; 

int getLocalDimensionO local; 

} 

/♦♦ 

♦ An <code>Operator</code> maps one vector into another vector. 

♦/ 

interface Operator { 

void apply(in Vector x, out Vector y) ; 

} 

/♦♦ 

♦ This interface represents the class of linear mappings. 

♦/ 

interface LinearOperator extends Operator { 

} 

/♦♦ 

♦ <code>StructVector</code> is a vector for structured grids. 

*/ 

class StructVector implements -all Vector { 
array<int> getGhostCellWidthO ; 

} 

/♦♦ 

♦ The structured matrix class implements all operator functions. 

♦/ 

class StructMatrix implements -all Operator { 

// methods used to build a structured matrix omitted 

} 

} 

Figure 1 A simplified SIDL interface description for portions of the hypre software 
library described in Section 3.3. 
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the name Vector (assuming, of course, that another Vector did not al- 
ready exist in that name scope). External symbol references are resolved 
by searching an associated symbol repository, either a file repository or 
a web-enabled repository such as Alexandria. 

3.1.3 Inheritance Model. The SIDL inheritance model is 
similar to that of Java. SIDL supports both interfaces and classes. The 
methods in an interface are abstract and thus not implemented by that 
interface. The methods in a class may be either abstract or implemented 
by that class. SIDL supports multiple inheritance of interfaces but single 
implementation inheritance of classes. An interface may extend other 
interfaces. A class may implement many interfaces but extend only one 
other class. This inheritance model simplifies the Babel implementa- 
tion and removes the diamond implementation inheritance ambiguity 
associated with C++. Like COM [12], all classes and interfaces implicitly 
inherit from a common base interface that provides reference counting 
and simple query interface capabilities. 

Based on suggestions from our users, we have augmented the Java 
inheritance syntax with an implements-all keyword, which declares 
that the associated class implements all of the methods in the specified 
interface. This keyword is equivalent to using the implements keyword 
and repeating the definition of all interface methods in the class body. 
The implements-all shorthand is cleaner and more closely reflects the 
way many of our users think about designing scientific libraries. They 
typically define abstract interfaces that describe the desired functionality 
and then combine those interfaces together into classes and components 
that implement that functionality. 

3.1.4 Arrays. SIDL supports the style of dynamically-sized, 
dense, multi-dimensional arrays that are common in scientific applica- 
tions. Existing IDLs such as CORBA [19] support only dynamically- 
sized, one-dimensional arrays (a CORBA sequence) and statically-sized, 
multi-dimensional arrays. Dense arrays consist of one physical segment 
of memory that can be accessed efficiently by an optimizing compiler. 
Such arrays are common in the scientific community due to its Fortran 
heritage and because dense arrays offer better access performance than 
"array of array” implementations. 

3.1.5 Parallelization Support. We have just begun to de- 
velop support for parallel data redistribution in the Babel tools. There- 
fore, the following discussion should be considered preliminary, although 
it does indicate our basic approach. SIDL currently supports parallel 
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communication directives that describe method behavior in a parallel 
execution environment. For example, the local method modifier in 
class Vector of Figure 1 indicates that the getLocalDimension method 
is valid only when invoked on an object in the same memory address 
space. For this method, the number of local vector elements owned by 
a particular processor has no meaning for a Vector object distributed 
across a different set of processors. 

Unlike PARDIS [15] and Cobra [22], we do not intend to add data 
distribution directives to the SIDL language. We do not believe that 
static IDL data distribution directives will be suflBcient to describe the 
dynamic complexity and wide range of parallel objects used in scientific 
computing. Instead, we plan to use run-time data descriptions of data 
objects. Distributed parallel objects will be required to support one of 
a set of data distribution interfaces through which the object describes 
its internal data distribution state. The Babel run-time will use that 
information to manage data redistribution during method invocations. 
We feel this approach is more appropriate for sophisticated data decom- 
positions that change during the course of a simulation. 

3.2. BABEL TOOL ARCHITECTURE 

The Babel tool suite consists of a number of separate pieces: a 
SIDL parser, a code generator, a small run-time support library, and 
the Alexandria component repository. Currently, Babel supports 
Fortran 77, C, and C++; we plan to develop support for Java, Python, 
Fortran 90, and MATLAB in the following year. 

The Babel parser, which is available either at the command-line or 
through the Alexandria web interface, reads SIDL interface specifica- 
tions and generates an intermediate XML [28] representation. XML is 
a useful intermediate language since it is amenable to manipulation by 
tools such as a repository or a problem solving environment. XML in- 
terface descriptions are stored either in a local file repository or on the 
web using Alexandria. The vision is that a scientist downloading a 
particular software library from the Alexandria component repository 
will receive not only that library but also the required language bindings 
generated automatically by the Babel tools. 

The Babel code generator reads SIDL XML descriptions and auto- 
matically generates glue code for the specified software library. This glue 
code mediates differences among calling languages and supports efficient 
inter-language calls within the same memory address space and, eventu- 
ally, across memory spaces for distributed objects. The code generators 
create four different types of files: stubs, skeletons, Babel internal rep- 
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resentation, and implementation prototypes. The Babel internal object 
representation created by the code generators is similar to that used 
by COM [12], CORBA’s Portable Object Adaptor [19], and scientific 
libraries such as PETSc [2]. The internal object representation is essen- 
tially a table of function pointers, one for each method in an object’s in- 
terface, along with other information such as internal object state data, 
parent classes and interfaces, and Babel data structures. Stub and 
skeleton code translates between the calling conventions of a particular 
language and the internal Babel representation. The code generators 
also create implementation files that contain function prototypes to be 
filled in by the library developers. To simplify the task of library writ- 
ers, we have added automatic Makefile generation as well as a “code 
splicing” capability that preserves old edits during the regeneration of 
implementation files after modifications to the SIDL source. 

3.3. TECHNOLOGY DEMONSTRATION IN 
HYPRE 

In collaboration with members of the hypre development team, we 
have integrated some of the Babel language interoperability technology 
into hypre [9]. The hypre library is a suite of parallel scalable linear 
solvers and preconditioners implemented in C with MPI. There were 
four primary goals of this collaboration. First, the Babel team wished 
to demonstrate the technology and get feedback from library develop- 
ers. Second, the hypre project needed automatically generated Fortran 
bindings that would track changes in the library. Previously, a num- 
ber of different Fortran bindings were developed by various users but 
fell into obsolescence with new changes to the hypre source. Third, the 
hypre team wanted to explore new design options using object-oriented 
and component-based software techniques, but the team had no desire 
to generate and support the necessary object-oriented infrastructure by 
hand. Finally, hypre developers wanted to integrate software developed 
by other groups who had written code in C++ and Fortran. 

The collaboration began by identifying key parts of hypre and devel- 
oping an object-oriented design in SIDL for the primary hypre objects. 
For the most part, existing hypre implementations were wrapped using 
glue code generated by the Babel tools. In spite of this additional in- 
termediate glue code, parallel runs with both Fortran and C drivers 
indicate that Babel overheads are too small to measure accurately. 

The developers of hypre identified a number of advantages to using 
Babel for their scientific software library in addition to the obvious ad- 
vantage of language interoperability. Developers found that SIDL was a 
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convenient specification description language for the design of scientific 
libraries because it eliminated unnecessary implementation details and 
forced them to focus on the object-oriented design of the library. They 
felt that SIDL was relatively easy to master, although some were new 
to object-oriented design and object-oriented languages. Furthermore, 
hypre developers noticed that they could eliminate redundant code by 
taking advantage of polymorphism. For example, the previous hypre 
library contained a four different preconditioned conjugate gradient rou- 
tines, each written for a particular type of preconditioner data structure. 
Through the use of polymorphism enabled by Babel, they were able to 
reduce the number of routines to one. Finally, the hypre developers were 
able to exploit object-oriented design in C, which has no object-oriented 
support, using the automatically generated Babel code. 

4. THE ALEXANDRIA REPOSITORY 

The Alexandria repository was designed and built to facilitate the 
adoption of component technology for high-performance scientific simu- 
lation software. Our goal was to provide a network service where com- 
ponent developers can publish their software and interface definitions 
and where application developers can find and download components 
and the associated language bindings. The system was intended to have 
a user interface to support human and machine clients. 

Alexandria provides a hierarchically organized collection of software 
packages uploaded by component developers, a fuzzy search capability, 
an interface definition browser, and a web user interface to the Ba- 
bel language interoperability tool. For machine clients, Alexandria 
provides a repository of XML interface definitions and will hold a repos- 
itory of shared libraries which implement particular interfaces to enable 
dynamic graphical application builders or other development tools. 

We chose to implement a web application (i.e., a web server with 
dynamic content managed by a program) to achieve these goals and fea- 
tures. A web application can provide a sophisticated and friendly user 
interface designed for human clients and a simple, feature-rich interface 
for machine clients. By using web technologies, we make the repository’s 
services available to the largest possible network audience; any contem- 
porary web browser can access Alexandria. Machine clients can use 
standard network libraries to access the repository. Other network ap- 
proaches would require installation of special purpose clients or more 
elaborate machine clients thereby decreasing the potential audience for 
the service. The HTTP protocol provides all the transaction types nec- 
essary for the repository: uploading files and other information from a 
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user interface form and downloading content. The transactional natme 
of the web maJses the user interface less interactive than a native ap- 
plication, but the benefits of the web interface seem to outweigh this 
deficiency. 

As shown in Figure 2, Alexandria uses a three-tiered architecture: 
a web browser based user interface, a web server with Java servlets [11] 
and JavaServer Pages [21], and a JDBC [26] connection to an SQL back- 
end. The web server delegates HTTP messages for certain URLs to Java 
servlets, and the servlet provides the content or an error response. A 
servlet is a Java class that implements a standard interface or overrides 
methods inherited from a standard base class. The servlet can use all 
the features of the Java platform in generating its response. JavaServer 
Pages is a convenient, dynamic way to generate a servlet which usu- 
ally combines HTML with embedded Java code to provide the dynamic 
content. 

The Alexandria application consists of five subsystems: an access 
control system, an inexact string matching package, a hierarchy man- 
agement system, a content package, and an interface to Babel. The 
access control system manages user accounts and provides several differ- 
ent levels of access to the system: administrator, trusted user, normal 
user and world. The inexact string matching package is a Java imple- 
mentation of the algorithm from agrep [30]. 
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The hierarchy management system provides cataloging, uploading and 
downloading features. Unlike a normal file system, the hierarchy can 
hold files with the same name in a common directory as long as they 
have different version numbers. The expectation is that over time a 
project will issue multiple versions of individual files. 

The content scanning package checks material provided by users to 
see if it is “safe content.” A responsible web server that receives content 
from users and then presents that content back to other users must verify 
that the user provided material does not contain hostile scripts. Rather 
than trying to characterize and detect hostile content, Alexandria tests 
user provided content against an XML DTD that contains a safe subset 
of XHTML 1.0 [27]. A validating XML parser is used to determine if 
user provided content is safe. If the material does not validate, all the 
mark-up directives are transformed so they will be interpreted as plain 
text rather than as mark-up directives. 

The interface to Babel subsystem provides language bindings for a 
SIDL file to users. The user’s SIDL file is uploaded to the web server, 
the web server runs Babel on the file, the results are packaged in a TAR 
file, and then the user is given the chance to download the file. This 
saves users from having to install Babel and a Java virtual machine on 
their local machine. 

Alexandria maintains a repository of XML type information. Users 
with sufficient access can translate the SIDL file into an equivalent XML 
representation and upload the XML representation to the repository. 
Once it is in the repository, anyone running Babel can use the XML 
information from Alexandria rather than having to explicitly download 
all the needed SIDL files. In addition, the web server provides high 
quality interface documentation to web browser by applying XSLT [29], 
a evolving standard for translating XML into HTML or other markup 
languages. 

5. CONCLUSIONS 

In this paper, we have described two pieces of a component technology 
architecture for scientific computing. Babel is a language interoperabil- 
ity tool that uses the SIDL interface description language to describe 
component interfaces and to generate code that mediates differences 
between programming languages. Alexandria is a web-enabled compo- 
nent repository that provides a browsable software library, automated 
access to SIDL type information, and web access to the Babel code 
generators. 
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Obviously, much work remains in developing production-quality com- 
ponent technology for the scientific computing community. Members of 
the Common Component Architecture working group have made some 
initial progress in this direction and have drafted a proposal that covers 
common behavior standards for components [1]. A number of interest- 
ing open research questions remain in extending current parallel data 
redistribution approaches [5, 15, 16, 22] to arbitrary data components. 
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DISCUSSION 

Speaker: Scott Kohn 

Thierry Priol : I do not understand why the data distribution specifi- 
cation is not exposed in the IDL associated with a parallel component. 
Scott Kohn : One of the goals of our work is to support the redistri- 
bution of very complicated scientific data objects, such as unstructured 
meshes, hierarchical adaptive mesh refinement structures, various matrix 
representations, and so on. We are not planning to build data distribu- 
tion specifications into the IDL because, at least at this time, we do 
not understand how to represent these diverse data decompositions in a 
static IDL description. To our knowledge, the only work in this area has 
focused on array structures. Another issue is that the IDL description 
is static, whereas data decompositions often change during the course of 
a simulation. We plan to concentrate on run-time descriptions of data 
objects. For example, a distributed parallel object will be required to 
support one of a set of data distribution interfaces through which the ob- 
ject describes its data distribution state. We feel this approach is more 
appropriate for sophisticated data decompositions that change during 
the course of a computation. 

Michael Thun4 : With regard to your conclusion, one could ask: can 
we do without component technology? What would be the alternative? 
Scott Kohn : I think some form of component technology will be 
necessary, whether it is scripting or some other style of integration ap- 
proach. I am simply not sure that our particular design choices are the 
correct ones. For example, how important is language interoperability? 
Is it sufficient to support one scripting language and one compiled lan- 
guage? If language interoperability is important, should we use an IDL 
approach? Should the IDL express parallelization and redistribution for 
complex data objects? I believe that there is still a lot of exploration 
and research to be done by this community. 

Richard Fateman : Regarding alternatives to component technology: 
Monolithic systems such as Lisp machines (built at various times by Xe- 
rox, Texas Instruments, Symbolics, and LMI) provided access to all as- 
pects of a computing environment: operating system, networking, com- 
pilers, memory management, object representation, visualization. There 
are major advantages to such an approach as shown by the impressive 
productivity of these systems when used by skilled programmers. In- 
adequate languages force system builders to deal with inter-language 
communication and many associated complexities — typically poorly as 
when error indicators are unchecked at interfaces, memory management 
is inconsistent, and data must be repeatedly rearranged and reformatted. 
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Scott Kohn : I agree that choice of language and the programming 
environment can significantly impact productivity. I question whether 
the scientific software community would adopt a single environment or 
a single language. In some sense, limiting ourselves to only one lan- 
guage would be a bad choice in that it would limit exploration. We 
use many different languages because each language offers different ad- 
vantages. Fortran, in spite of all of its limitations, is a very good lan- 
guage for array manipulation. C-I-+ offers object-oriented capabilities 
at a reasonable cost in terms of performance. Java is a better object- 
oriented language, but performance is not as good as C-l— 1-. Python 
provides scripting capabilities. I don’t see any single language as an 
overall solution. Component technology is a very pragmatic solution to 
the integration of diverse languages and environments. 

John R. Rice : Suppose everyone agreed to use a single language 
forever more. How would this eliminate the need for a component tech- 
nology? I think it would still be essential. 

Scott Kohn : I agree, although I think the need for component tech- 
nology would be diminished. For example, performance considerations 
aside, Java and Python are very good programming languages that share 
many characteristics of a good component system: physical deployment 
and packaging standards, dyncimic loading, good support for abstrac- 
tion, interface metadata, and common object behaviors. In the scientific 
domain, I think components also offer advantages for distributed com- 
puting and parallel data communication between components. To be 
pragmatic, however, technology is always changing, and the community 
would not want to choose a single language forever more. We need an 
integration approach such as components that will adapt to the changing 
technology landscape. 
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and numerical solution methods are left intact. We report on some pre- 
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demonstrate the feasibility of this approach. 
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1. INTRODUCTION 

It has become increasingly clear in the large-scale simulation commu- 
nity that developing efficient programs for complex simulations on large 
parallel computers is a laborious and difficult task. Therefore, several 
frameworks have been developed to ease the implementation of large- 
scale, parallel, simulations, such as POOMA [3], Overture [7], SAM- 
RAI [12], ALEGRA [6], ALICE [1], and SIERRA [22]. These frame- 
works generally blend innovations in computational techniques with in- 
novations in software technology; however, typically focusing on a few 
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techniques and applications. Since these frameworks simplify the imple- 
mentation of parallel applications, it was assumed that these frameworks 
would be the right platforms for implementing the combined simulations 
of physical processes. However, although the efforts of constructing these 
frameworks in themselves have been successful, none of these frameworks 
has attracted a large user base or been widely adopted outside their field 
of application. A major concern of the application community is the re- 
quired complete conversion of their software to make use of one of these 
frameworks, which means rewriting, loss of control over many aspects 
of their software, and the code’s resulting dependence on the existence 
of and continued support for the framework. Moreover, no framework 
supports the wide variety of discretization schemes and numerical tech- 
niques that exist, and combining codes from different frameworks is still 
hard. 

The Common Component Architecture (CCA) [2] and initiatives like 
PAWS [4] have been started to address this problem. Most research 
focuses on either the general aspects of components (CCA) or the devel- 
opment of very specific components, such as linear solvers (ESI) [8] and 
their interface to finite element programs (FEI) [10]. 

We aim to apply component principles at the level of whole applica- 
tions, so that parallel applications can run both stand-alone and with 
other applications in our programming environment. The changes to 
existing parallel programs should remain minimal. 

In this paper, we will describe our first experiments with this ap- 
proach using the Charm-|-+ [14] parallel runtime system in the Center 
for Simulation of Advanced Rockets. 

We are building an integrated multi-physics Rocket Code from sev- 
eral stand-alone pieces, including a computational fluid dynamics code 
(ROCFLO), and a structural analysis code (ROCSOLID). These codes 
are tied together using an interface code (ROCFACE). Jointly these 
codes are used to simulate solid propellant rockets. ROCFLO solves 
the equations describing the core flow in the inner part of the rocket. 
ROCFLO also models the combustion of the propellant, at present (while 
a full 3D combustion code is being developed). ROCSOLID solves the 
equations describing the movement (deformation) of the solid propellant, 
liner, and casing. ROCFACE takes care of the transfer of data and the 
necessary (conservative) interpolation of physical values (temperature, 
pressure, and displacement) between these two applications. Since the 
two applications were originally constructed independently for different 
purposes, they contain disparate meshes. The transfer of data between 
these meshes involves finding a matching between adjacent elements on 
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the two meshes, and determining the functions that need to be used to 
interpolate values between those elements. 

In the future, we hope to be able to integrate axlditional codes to 
handle specific situations within the rocket. A 3D combustion code may 
be added to the system. A crack propagation code will be used to model 
cracks that sometimes open up within the solid propellant. These are 
important because if a pressmrized crack reaches the rocket casing, the 
combustion can burn through the case, causing the rocket to explode. 
A code simulating the paths of particles of aluminum, released from 
the solid propellant, which enter and burn within the core fiow of the 
rocket, is being prototyped. Other capabilities may be implemented as 
stand-alone codes or additions to existing code, such as models of the 
mechanical joints in the rocket case, turbulence within the core flow, and 
the ablation of the rocket nozzle. 

Although our approach seems to be an effective way to solve highly 
complex problems, sepaxate the concerns of simulations of different phys- 
ical processes, and preserve past effort in developing simulation pro- 
grams, we know of no group that has pursued it as a systematic ap- 
proach. That is, to our knowledge, no programming environment has 
been developed, to date, that couples, with minimal changes, existing 
stand-alone applications. 

Several groups have worked on the mathematical and physical aspects 
of simulating two interacting processes (such as fluid-structure interac- 
tion) by simulating each part separately and solving the combined prob- 
lem by forcing consistent solutions on the interfaces. This is referred 
to as a partitioned solution procedure [9]. However, this has only been 
done for specific and special problems, and the main results are specific, 
detailed schemes for those problems. Probably these schemes can be 
extended to other problems, and this has been pointed out [19]. How- 
ever, again it seems a general programming environment that supports 
combining multiple grid-based applications for complex multi-physics 
simulations, especially involving highly dynamic adaptive simulations, 
has not been studied or built. 

1.1. THE INTEGRATION OF NUMERICAL 
METHODS 

The efforts in the CCA and other component projects aim at exchang- 
ing data without any semantics (meaning) associated with the data. 
These projects address flexible, but raw, data transfer. The aim of these 
standards is to be able to carry out the exchange of data between as wide 
a class of components as possible by setting standards for generic mech- 




90 ARCHITECTURE OF SCIENTIFIC SOFTWARE 



anisms and by encouraging components to support multiple formats for 
data exchange. 

Although this is important and should greatly enhance re-usability 
and inter-operability, it is not enough to really ease the composition of 
multiple numerical codes that were not developed together or with a 
different use in mind. Moreover, in order to reach true plug-and-play 
for numerical components, as advocated in [2], especially without the 
need to bring all specialists together for each extension of the simulation 
code, we need an environment that is aware of the interaction between 
numerical methods. 

On the other hand, existing component projects which do deal with 
the semantics of data at the interface, such as the FEI (Finite Element 
Interface), are very limited in scope (and they do not share a strict 
definition of components with the CCA group). They address only one 
important coupling: finite element codes with linear solvers. 

2. A NEW TYPE OF INFRASTRUCTURE 
FOR COUPLED SIMULATIONS 

The emphasis in this paper is on an infrastructure to facilitate the 
implementation of multi-physics simulations. Addressing the special re- 
quirements of coupled applications significantly complicates the imple- 
mentation, and we briefly outline the most important functionality here. 
We divide the functionality into three major categories: 

■ Application coupling technology, 

■ Orchestration of multiple independent simulation programs with 
highly dynamic behavior, 

■ Computational Steering (physical model, mathematical model, 
and performance). 

Without going into great detail, we would like to outline here some of 
the most important issues for partitioned, coupled simulations. 

2.1. APPLICATION COUPLING 
TECHNOLOGY 

The coupling of applications in a partitioned simulation involves four 
major issues: mesh matching, physically and mathematically consistent 
mapping of boundary data, the coupling of the separate solution pro- 
cesses in each application, and the coordination of the separate time- 
stepping procedures. 
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Generally, the constituent applications come with their own meshes, 
discretizations, and internal data structures. As the meshes may differ 
in type from application to application, they may not be aligned or 
even coincide at the physical interface. Moreover, these meshes will 
typically have been partitioned independently for parallelization. In 
addition, they may differ significantly in spatial resolution. As a result, 
the transfer of data in a physically and mathematically correct way is 
very complicated. 

We need to compute a matching between the meshes of interacting 
applications. This matching will indicate the interaction between in- 
terdependent applications at the level of individual elements or cells. 
Together with the equations that must be satisfied on the boundary, 
constraints on the mapped values, and possible conditions to be satis- 
fied, the results of the mesh matching will define the mapping of variables 
between applications [18, 13, 15, 19]. 

After identifying which parts of separate meshes correspond, we need 
to define mappings of those variables that are needed to make the math- 
ematical model in the neighbor application well-defined. The physics of 
the underlying application or mathematical stability constraints may re- 
quire that certain functions of mapped variable fields be conserved. This 
can be relatively simple as in the conservation of a single variable like 
mass, or it can be more complicated as in the conservation of an inte- 
gral over a function of multiple variables, such as work expressed as the 
product of displacement times force, and momentum expressed as mass 
times velocity [5, 9]. However, even the conservation of a single variable 
may not be trivial if we map variables between meshes that differ in 
type of discretization (finite element versus finite volume), element type 
(tetrahedral vs hexahedral), finite element basis functions, or if meshes 
vary widely in spatial resolution (note that there may be accuracy con- 
straints). In addition, some variables may have specific constraints. For 
example, a mass value would be required to be non-negative. 

In the partitioned approax;h we run each application separately, and 
they interact through their boundaries. If we require at each (major) 
time step that values on a shared boundary be consistent among appli- 
cations, then we call such an approach strongly coupled [9]. In this case 
we have to decide on the tolerances on the convergence across applica- 
tions. If we do not require this consistency we call the approach loosely 
coupled [20]. We envision using both approaches. These choices have 
an influence on the overall solution accuracy and efficiency. A potential 
problem that arises in the strongly coupled procedure is divergence or 
stagnation of convergence across multiple applications. Even though in 
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each step all applications converge, their shared representation of the 
(fields of) variables on one or more boundaries may not. 

The issues involved in setting time steps are closely linked with or- 
chestration, but there are also independent issues. We need a sequence 
of global (maximum) time steps at which each application must deliver 
a solution. In the case of a strongly coupled approach, we require consis- 
tency at these global time steps. Within the limits of a global time step, 
each application can choose its own time steps. The global time steps 
will generally be determined by accuracy constraints, the time-scale of 
relevant physical phenomena, or a global CFL-like condition. 

2.2. ORCHESTRATION 

The orchestration of a collection of applications defines at a high level 
how the various simulations interact, how each does its time integration, 
and how these different time integration schemes are combined. Further- 
more, several problems can arise while running such a multi-application 
simulation in a partitioned approach. Convergence problems may occur, 
in a single application or across multiple applications. Moreover, the en- 
vironment may have to dynamically start additional applications, such 
as turbulence in a Computational Fluid Dynamics (CFD) application, 
or crack propagation in a structural application, and dynamically swap 
applications if required by the simulation. The programming environ- 
ment must orchestrate highly dynamic interactions of parallel partitions 
of applications, based on assumptions of applications, requirements on 
boundary exchanges, and convergence across applications. 

2.3. STEERING 

Since simulations may run for long periods of time on parallel super- 
computers, we must be able to interact intelligently with these simula- 
tions and maybe spin-off additional simulations derived from intermedi- 
ate results of the main simulation. It may also be necessary to adjust 
load balancing and parallelization schemes periodically, to optimize the 
performance while the application is running or to adjust these schemes 
to changes in the computational environment. Therefore, our program- 
ming environment needs to include steering, both from a mathemati- 
cal/physical point of view and from a computational/performance point 
of view [16, 21]. Given that these programs will run for very long times, 
it is unlikely that the results will be continuously monitored. Hence, 
we envision the use of smart, event-driven check pointing. Based on a 
description of the relevant states of the simulated processes, or of the rel- 
evant dynamic behavior, the programming environment should be able 
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to save the necessary data when triggered by the appropriate condition. 
The check-pointed data could be used later to run the simulation with 
higher precision or with more appropriate models. 

Conceptually, a control channel would be kept open between the 
codes, facilitating orchestration and steering. This channel would allow 
a code to request services from, and report conditions to, the framework 
and other codes. This control channel would allow the framework to 
“call” the individual codes at the appropriate times, initiate data trans- 
fers, start codes in response to dynamic events, and handle exceptional 
situations. 

3. AUTOMATIC TRANSLATION INTO A 
FRAMEWORK 

The simulation codes should be independent, each capable of running 
by itself, but also able to cooperate with others when used within the 
framework. To avoid re-writing a code to fit into the framework, we pro- 
pose the use of automatic tools which could do the translation required 
to allow the code to fit into the framework. 

The translation of a given code would be guided by a Code Description 
(CD). The CD would essentially be an interface specification for the 
code, giving all the information necessary to 

■ locate data within the code which would be available for other 
codes, 

■ drive the conversion of data from one code to another, 

■ define the conditions placed on data passed to the code from the 
outside, and 

■ describe special functionality of the code which might be used un- 
der exceptional conditions. 

An Orchestration Description (OD) would be used to describe how the 
various component codes should interact. It would describe which mod- 
ules should exchange data, how to convert the data from one component 
to the other (including the matching of mesh points [13]), conditions un- 
der which certain modules should be invoked, a system-wide coordinate 
system for the various meshes, and any system-wide constraints or in- 
formation which spans components. Trigger conditions could be spelled 
out in this specification which show how unusual situations should be 
handled, when it is necessary to switch to a different simulation code, 
and when check-pointing should occur. 
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The CD and the OD could take the form of a set of annotations, in- 
terspersed with the statements of the simulation code itself, such as the 
structured comments of OpenMP [17]. It could also take the form of a 
separate description file, written in some language, such as Python [23]. 
The structured comment approach has the advantage that a direct cor- 
respondence could be drawn between the description and the code itself. 
The separate description file has the advantage that it pulls all the pieces 
of the description together in one place. 

The CD would guide the placement of data ports [2] within a given 
code, allowing the code to communicate the values of certain variables 
with other codes and the Framework. The Framework would employ in- 
terpolation functions appropriate to the quantities and numerical meth- 
ods described in each code’s CD. 

The OD would guide the automatic creation of a “driver” routine for 
the simulation, calling the individual components in the proper order and 
with the proper parameters. The driver routine would contain the proper 
convergence criteria, and tests for the trigger conditions, as specified in 
the OD. 

A well-known guiding principle from Software Engineering is to auto- 
mate those programming tasks that can be best done by the computer, 
while giving human programmers the tools to carry out tasks best done 
by them. The CDs and the OD for a simulation system would give the 
programmer a human-friendly way of describing the complex codes to 
be combined, and how to combine them. 

The framework would employ compiler technology to analyze each 
component code, the CDs and the OD. It would then carry out the 
tedious transformation to automatically generate the combined code. 

4. PRINCIPLES OF APPLICATION 
INTEGRATION 

To guide the implementation of our model of application integration, 
we state a set of general principles. A framework which adheres to 
these principles should be able to couple multiple stand-alone application 
codes successfully. 

Cooperative Interoperability Principle: ‘‘^Different stand-alone ap- 
plication modules should coexist as a part of o single simulation” The 
codes should execute concurrently and exchange the results of their com- 
putations without human intervention. The framework would not im- 
pose its own restrictions on the parallelization or memory usage of the 
codes. It should accept whatever code optimizations are used by each 
code. 
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Minimal Source Change Principle: '^The changes required of a 
stand-alone simulation code for use within a framework should be mini- 
mal.” This principle addresses the software engineering issues of using 
an integration framework. The idea is that there should be only a single 
version of a Code. That code is then converted automatically for use in 
the framework. Development may then continue on the single version, 
which is converted for use in the framework whenever necessary. This 
lessens the error-prone work needed for converting the code by hand, al- 
lows the programmer to maintain a single version of a code, and allows 
the code author to retain ownership of the code. 

Correct Data Exchange Principle: “'The framework should exchange 
data between codes in a mesh-, numerics-, and physics-aware fashion. 
” This principle addresses the ease of converting data between codes. 
The framework should contain facilities for mesh-cell-matching and data 
conversion, relieving the programmer from the effort of programming 
these things. The framework should have a set of physically-correct, 
numerically stable conversion techniques already deployed. 

Dynamic Adaptation Principle “The framework should dynamically 
respond to exceptional situations within the simulation.” This principle 
addresses the ability of the simulation system to adjust the operation of 
the overall simulation in a very fine-grained way, in response to dynamic 
situations. If the conditions within the simulation go beyond the bounds 
of a given code in one part of the mesh, a different code may have to be 
started to take over the simulation from that code, for that part of the 
mesh. 



5. PROOF OF CONCEPT WITH A LOAD 
BALANCING FRAMEWORK 

The class of frameworks described thus far in this paper is broad. 
Many frameworks could be built, based on the principles outlined in 
Section 4. In our own project, we have proceeded in measured steps 
toward the general goal of an integration framework for simulation codes. 
We are building on the experience of integrating our rocket simulation 
code by hand. The Rocket Code developers (Prasad Alavilli, Dennis 
Parsons, Ali Namazifard, and Jim Jiao) manually integrated the two 
simulation components (ROCFLO and ROCSOLID) with an interface 
code (ROCFACE). 

However, the task of building a general multi-simulation code integra- 
tion framework from scratch is a daunting one, so we have decided to do 
a proof-of-concept with a less ambitious task - implementing a frame- 
work for doing load-balancing for our hand-integrated Rocket Code. The 
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basis of the load-balancing system is the Charm-|-+ system, developed 
by Kale, et al. 

Toward that end, we first chose to implement the hand-integrated 
rocket simulation code on top of the Charm-1— I- system. We chose to use 
Charm-|— I- as a substrate for our integration framework because of its 
support for automatic interleaving of multiple components, and because 
of its dynamic load balancing abilities. We developed a methodology 
for converting this code to the Charm-t— I- system that we believe is au- 
tomatable, so that a compiler-based tool could be built for automatically 
converting a code for use with Charm-1— I- . This automation of the con- 
version process will satisfy the Minimal Source Change Principle. 

In the following sections, we will briefiy describe the Charm-1— I- sys- 
tem, and how we chose to use it with our Rocket Code. 

5.1. THE CHARM-f-f SYSTEM 

Charm-|— I- is an explicitly parallel object-oriented system. Charm-|-+ 
programs are typically written using C-I-+, and use a small interface 
description language, along with the Charm-f-+ runtime support system 
(RTS). 

The basic entity in Charm-1— I- programs is a data-driven object. A 
computation comprises many such objects (or indexed collections of 
such objects), which are mapped to processors under the control of 
the Charm-I— I- RTS. Such objects have a global, system-wide ID, and 
can communicate with each other via asynchronous method invocations 
using these IDs. As the IDs remain the same, even when the RTS mi- 
grates objects from processor to processor, the application-writer can 
write their code without concerning themselves with load balancing (i.e. 
they write the code for one object to communicate with the other with- 
out concerning themselves with where these objects live). 

The core of the Charm-|-|- RTS consists of a message-driven sched- 
uler. Messages in Charm+-I- represent computations to be performed 
(methods to be invoked on data-driven objects). The scheduler repeat- 
edly chooses messages from a processor- wide pool and executes methods 
denoted by them. Thus, messages directed at difierent objects (possi- 
bly belonging to different modules) are interleaved allowing concurrency 
across different components on the same processor. 

The data-driven objects provide a natural “grain” of execution to 
be monitored for possible load imbalance. The Charm-|-+ RTS incor- 
porates a load balancing support module, which keeps track of execu- 
tion times for each object, and communication patterns among objects. 
These statistics are then provided to a “plug-in” load balancing strategy 
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module that decides whether and how to remap these objects among 
processors, to get better load balance. 

5.2. OUR APPROACH TO USING 
CHARM++ 

The component codes, ROCFLO and ROCSOLID, were both written 
in Fortran 90, using MPI to implement parallelism and message passing. 
MPI forces the user to identify processors with integers representing the 
processor numbers. With MPI, the number of MPI processes is equal to 
the number of processors. To connect the MPI code with Charm++, we 
chose to replace the MPI runtime library with a library implemented on 
top of Charm-t-+. In this form, the integers in the source code no longer 
represent processor numbers, but instead indicate chunk numbers. 

A chunk in this context refers to the combination of a thread of execu- 
tion and its data. In the context of an MPI program, a chunk is similar 
to an MPI process, but without the separation of address spaces that is 
normally present with MPI processes. 

By doing this, we decouple the application code from a specific number 
of processors, and decouple a specific chunk from a specific processor. 
Then, Charm-f-l- is free to allocate more chunks than processors, and 
move chunks around from processor to processor, if load-balancing is 
required. 



5.3. LOAD BALANCING METHODOLOGY 

The approach that we are exploring for the load balancing framework 
involves multi-partition decompositions. Computations in each individ- 
ual module are partitioned into a large number of chunks, such that 
there are many more chunks than processors. The code and data for 
simulating each chunk is encapsulated within a data-driven object. The 
program is written in such a way that the objects send messages to other 
objects, rather than sending messages to processors. As processors are 
not part of the programmer’s ontology, the system is free to move or 
migrate objects among processors, thus effecting load balancing when 
needed. 

As multiple chunks, possibly belonging to different modules (or ap- 
plications) are mapped to each processor, their execution must be inter- 
leaved by the runtime system. Data-driven interleaving, which depends 
on a scheduler to schedule computations of individual chunks, depending 
on the availability of their data (messages), accomplishes such interleav- 
ing efficiently. As the chunks are migrated from processor to processor. 
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their messages must be correctly forwarded. Both of these features are 
effectively supported by the Charm++ system. 

For accomplishing load balancing, Charm++ incorporates a sophisti- 
cated load balancing subsystem. The particular strategy we use exploits 
the “principle of persistence”; the fact that in most scientific compu- 
tations, the computational loads of the chunks, and their communica- 
tion patterns, are highly correlated with their values in the immediate 
past. This is true even for computations that require abrupt adaptive 
refinements, since such refinements are relatively infrequent. The load 
balancing framework carries out measurements of these characteristics, 
and then balances the load when needed, using a suite of strategies that 
are useful in different circumstances. 

5.4. AUTOMATIC CODE CONVERSION TO 
THE CHARM-f + FRAMEWORK 

One of the challenges we faced was reconciling the need for Charm-f- 1-, 
with our desire to make minimal changes to existing application codes. 
This challenge was overcome with the development of an additional run- 
time library. The Adaptive MPI (AMPI) library was built on top of 
Charm-H- 1- to provide a complete implementation of the MPI library 
routines. The MPI calls in the original code are intercepted by this 
library. With these techniques, it became possible to port the existing 
MPI codes, written using Fortran 90, to our run-time framework. 

A few other changes to the applications were still needed within the 
application codes. When the MPI processes of the application codes 
are converted to chunks, they lose the address space separation of the 
original codes. This means that all globally-visible data of the original 
codes for the chunks executing on a single processor would be placed 
at the same memory locations, and interfere with each other during 
execution. So, such references had to be eliminated from the application 
codes. This is possible by dynamically allocating them at run-time, or 
else by statically allocating expanded versions of the global variables and 
indexing them by the chunk number. 

In addition, subroutines that pack and unpack the chunk’s private 
data were coded by hand. However, this process is quite mechanical, 
and could be completed easily once the principles were understood. 
ROCFLO and ROCSOLID were converted with a few days of effort, 
whereas ROCFACE (which was written in C-l— I- with MPI) was con- 
verted in 45 minutes. The observations made during this conversion 
process, coupled with the compiler expertise in our team, led us to real- 
ize that this conversion process can be fully automated with the help of a 
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Table 1 Comparison of MPI and AMPI versions of ROCFLO & ROCSOLID. All 
times are in seconds. Note that this is scaled problem. 



Processors 


ROCFLO 


ROCSOLID 




MPI 


AMPI 


MPI 


AMPI 


1 


9.0192 


8.8122 


18.240 


17.797 


8 


8.0796 


8.0958 


18.413 


18.458 


16 


8.1908 


8.2682 


18.564 


18.830 


32 


8.3415 


8.3093 


19.410 


18.947 


64 


8.5535 


8.6183 


19.236 


19.500 


128 


9.4889 


9.6370 


19.766 


20.499 



compiler that can perform interprocedural analysis and source-to-source 
transformations. Work on such an automatic conversion program is in 
progress. 

With these conversions, the rocket simulation programs are now ready 
for adaptive automatic load balancing. We expect to test these abilities 
in the near future, when the application incorporates such features as 
adaptive mesh refinements, and also when running on dynamic environ- 
ments such as workstation clusters. 

6. EXPERIMENTAL RESULTS 

Original ROCSOLID and ROCFLO performance results were com- 
pared with their implementations on our framework. Refer to Table 1 
for the performance results. Experiments were performed at National 
Center for Supercomputing Applications (NCSA) on an 0rigin2000 ma- 
chine (250 MHz RIOOOO processors). A version of Charm-1— t- that uses 
native MPI as a communication layer was deliberately chosen in order 
to measure the overhead that AMPI incurs over MPI. Charm-1— t- can 
also be made to use shared memory arenas for communication, resulting 
in better performance. 

In the process of conversion to our framework, the global data items 
used by both codes were “privatized” with respect to threads, so that 
multiple threads could co-exist on the same processor. Timings on one 
processor point to the effects of this privatization. In ROCFLO, this pri- 
vatization was done by extending the dimensions of global data items, 
where the thread number was used as an index to access thread-private 
data. In ROCSOLID, we encapsulated the global data items in a sin- 
gle user-defined type, which is dynamically allocated by every thread at 
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initialization. Each data item was then accessed with an indirect ref- 
erence. Surprisingly, this speeded up execution of both ROCFLO and 
ROCSOLID on one processor. We suspect this effect was due to co- 
incidental rearrangement of data items, reducing cache misses. These 
are still preliminary results, and more thorough experiments are being 
performed. 

It should be noted that the communication overhead due to the ad- 
ditional communication layer of Charm-f-+- and AMPI is eclipsed by 
better cax;he behavior, and is less than 4% even on a higher number 
of processors. We suspect that the overhead on more processors is be- 
cause collective operations in AMPI are not tuned for higher number of 
processors. 

7. SUMMARY 

Existing code integration frameworks have not attracted a large num- 
ber of users. We believe that this is due to primarily Software Engineer- 
ing issues, such as the need to rewrite a code to use a framework, the 
need to maintain multiple copies of a code, and the error-prone nature 
of recoding a working program. 

We have proposed a set of code integration principles that we believe 
will make integration frameworks more widely accepted by the applica- 
tions community, because frameworks adhering to these principles would 
support the plug-and-play use of a code within any of them. 

A preliminary experiment has targeted codes for a Charm-t—t- load 
balancing framework and we have found that automatic translation of 
codes for that framework is indeed feasible. 
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DISCUSSION 

Speaker: Milind A. Bhandarkar 

Morven Gentleman : How much of your approach depends on the 
open context in which RCSOLID and ROCFLO were developed? Would 
it be possible in a context where, perhaps for proprietary reasons, the 
code, designs, data structures, algorithms, etc. were not so exposed? 
Eric de Sturler : [This reply was provided after the conference. - Ed.] 
First of all, the prime intention of this framework is to help collaborat- 
ing researchers to combine their codes for multiphysics simulations with 
minimal efforts and in a short time frame. To this end the code offers 
functionality to exchange boundary data, to match disparate meshes, 
coordinate timesteps, and compute physically correct solutions of the 
joint physical processes taking place on the boundary. The mathematics 
and physics needed for a correct time integration generally needs to be 
written for the specific applications, but can then be implemented easily 
within the framework. 

Clearly, this requires the framework to have access to several types 
of data from each application and the associated geometric information: 
field variables defined on the boundary, time step constraints, the geo- 
metric information that defines the mesh on the boundary, type of finite 
elements/ volumes, shape functions, and potentially other data. The 
way this has been done for currently integrated codes is that developers 
added a single (well-defined) module in which the data needed in other 
applications is extracted or computed and declared and in which the 
data needed from other applications is declared. This is the only data 
that needs to be visible to the framework. In order to (dynamically) 
start a partition of an application with its own data (chunk), to migrate 
it to another processor (load balancing), or to remove it, the framework 
must be able to interact with the application (partition). Some handle 
must be passed to the framework to allow it to signal a chunk that it 
will be moved. The chunk must then be able to export its data to the 
framework, after which it will be removed. After migrating to another 
processor the chunk will be restarted with its data. 

Following encapsulation and separation of concerns principles the 
framework is designed such that the handles mentioned above should be 
the only features required to implement the functionality of the frame- 
work. The importance of such a design is that new codes can be added 
to the framework without undue changes in other codes and without 
significant interaction with developers of other codes. In general some 
interaction will be needed if new/different data is required for interac- 
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tion with the new code. For example new coupling algorithms may need 
to be added. 

If one wants to integrate a code that is not available as source code, 
some wrappers will be needed. For example, many finite element pack- 
ages are available as a library of object modules that can be called in 
the user’s program. The internal data structures, algorithms and so on 
are typically unknown. Only the (public) interface is described. In this 
case we would write a small subroutine that has the required handles, 
initializes the data for the subroutine(s) that will compute the solution 
to the sub-problem of interest, call the library routine(s), extrax^t the 
required data from the results and provide those in a form that is usable 
and visible to the framework. 

One could think of strategies to use simulation codes that are available 
only as a single monolithic executable, but we think this would not be 
useful. 
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Abstract This paper describes an approach for scientific code coupling using 
CORBA objects. Our approach is based on an extension of CORBA, 
called PaCO (Parallel CORBA Object), to support efficiently the en- 
capsulation of parallel codes into distributed objects. With such exten- 
sion, a parallel code can be seen as a collection of identical CORBA 
objects. Our extension to CORBA modifies only the Interface Defini- 
tion Language (IDL) syntax by adding new language constructs. These 
new keywords allow the specification of several aspects associated with 
a collection of objects. We developed a new IDL compiler that gener- 
ates stubs and skeletons to manage collections of objects transparently 
to the users. Parallel CORBA objects have been used within an indus- 
trial application from Aerospatiale Matra in the field of Electromagnetic 
simulation. The paper gives some performance results. 

Keywords; CORBA, parallel CORBA object, coupled simulation, distributed nu- 
merical simulation. 



1. INTRODUCTION 

Scientific computing is an inescapable reality to design complex sys- 
tems. By allowing virtual experiments, numerical simulation can speed- 
up the design phase and decrease its associated cost. It is thus an ex- 
cellent approach to increase the competitiveness of the industry. For 
many years, the aerospace industry made an intensive use of numerical 
simulation. However, they are now facing an important challenge. As 
physical systems being more and more complex, their numerical sim- 
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ulation requires the combination of several simulation codes which are 
either developed in-house or purchased from software vendors. Each of 
these codes simulates a particular physical behavior (computation fluid 
dynamics, structural analysis, electromagnetic, ...). To reduce the cost 
of developing such simulation applications (also called coupled applica- 
tions), aerospace companies try as much as possible to reuse, or to adapt, 
existing codes they already developed in previous projects. Therefore, 
designing a simulation application for a new industrial project is often 
seen as an integration of existing codes. Such integration consists in de- 
veloping a framework that is in charge of calling the different simulation 
codes in a specific order and to let them to exchange their simulation 
results. An object oriented approach is thus well suited for the develop- 
ment of such a framework requiring that each of these simulation codes 
to be an object. Moreover, it is obvious to say that the execution of 
such numerical simulation application has to be performed in a rea- 
sonable time frame. However, since the application is made of several 
codes, simulation times is being increased. To keep it in a reasonable 
time frame, it is necessary to be able to exploit several computing re- 
sources which are available at either an intranet or at the Internet level. 
It is therefore mandatory to develop a coupled application in such a way 
that eaeh code can be ran on a distinct computing resource. Moreover, 
the design of new complex systems requires the involvement of several 
industrial participants having their simulation tools running on their 
own computing resources. For confidentiality reasons, industrial actors 
are often reluctant to share their data or their simulation tools. For 
these two reasons (confidentiality and computing resources availability), 
it is thus necessary to perform the simulation in a distributed manner. 
Again, an object-oriented approach is able to fulfill these requirements. 
Indeed, the use of distributed objects allows transparent remote execu- 
tion permitting the exploitation of geographically dispersed computing 
resources. However, distributed object technologies have to be adapted 
in the context of high-performance computing taking into consideration 
that a distributed object has to encapsulate a parallel code. 

In this paper, we relate our experience that was aimed at coupling two 
instances of an electromagnetic simulation code to perform a distributed 
numerical simulation. The coupling is based on a extension of CORBA 
(Common Object Request Broker Architecture), called PaCO (Parallel 
CORBA Object). The remainder of this paper is structured as follows. 
Section 2 discusses some issues when coupling codes together. Section 
3 gives a short overview of CORBA. Section 4 provides a description of 
the concept of parallel CORBA object. Section 5 describes the electro- 
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magnetic simulation application and some performance results. Finally, 
we conclude in section 6. 

2. CODE COUPLING 

The coupling of codes requires an underlying communication mecha- 
nism to allow the transfer of both data and control between the codes. 
The transfer of data corresponds to the sending of the results calculated 
by one of the code and the receiving by another code whereas the trans- 
fer of control is the execution of one particular function to be performed 
remotely in another code. Most of the attempts to couple codes together 
rely on the use of message-passing libraries such as MPI. However, MPI 
was mainly designed for parallel programming and not for distributed 
programming. Current MPI implementations are not interoperable so 
that it is not possible to exchange messages between different computing 
resources due to their heterogeneity. Research works have recently lead 
to extend existing message passing libraries, such as MPI, to be able to 
exchange data between heterogeneous computing resources. MPICH-G 
[2], PACX [1], PLUS [9] or MPLConnect [5] are examples of such exten- 
sions. However, these extensions have some drawbacks. As for instance 
PLUS and PACX do not allow several processes of a parallel code to 
transfer data simultaneously (thus efficiently) to some other processes of 
a parallel code running on a remote machine. One node of each machine 
acts as a bridge to let processes of the two parallel codes to communi- 
cate together. Such design does not offer a scalable way to communicate 
between parallel machines. It thus represents a potential bottleneck 
when communicating between several parallel machines for which their 
compute nodes are connected to an Ethernet network (such as the IBM 
SP-2, NEC Cenju-4, and clusters of PCs or workstations). If two of 
these machines are connected using a multi-gigabit network, it will not 
be possible to exploit the whole capacity of this network since the com- 
munication rate will be bounded by the performance of the network that 
connect one compute node^ to the multi-gigabit network^. 

Moreover, even if an implementation of MPI allows multiple flows of 
data to be sent simultaneously, we think that message-passing is not suit- 
able to connect several parallel codes together. Indeed, message-passing 
was mainly designed for parallel programming and not for distributed 
programming. It means that it is mainly used to transfer data but not 
the control. As for instance, if one code would like to call a particu- 
lar function into another code, this latter has to be modifled in such a 
way that a message type is associated to this particular function. Such 
modification requires often a deep understanding of the code. Moreover, 
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Figure 1 Example of an IDL interface 



Figure 2 CORBA Architecture 



entry points in a code are not really exposed to potential users that 
would like to include such code into their applications. When control 
has to be transfered between codes running on different machines, com- 
munication paradigms such as RPC or distributed objects, offer a much 
more attractive solution since the transfer of control is implemented by 
remote invocation that is as simple as calling a function or a method. We 
advocate an approach, like others [4], that consists in merging several 
communication paradigms in a coherent way. This approach is based on 
the use of two communication paradigms: distributed objects (CORBA) 
and message-passing (MPI). 

3. A SHORT OVERVIEW OF CORBA 

CORBA is a specification from the OMG (Object Management 
Group) [6] to support distributed object-oriented applications that are 
based on a client/server approach. An application based on CORBA 
can be seen as a set of independent software entities or CORBA objects. 
Each server object is associated with an interface that describes which 
operations can be remotely invoked by a client object. Object interface 
is specified using the Interface Definition Language (IDL) as shown in 
the example given in Figure 1. In this example, the interface contains 
two operations {mult and skal), each of them having some parameters 
whose types are similar to C-l— t- ones. A keyword added just before 
the type specifies if the parameter is an input or an output parameter 
or both. IDL provides an interface inheritance so that services can be 
extended easily. 

Figure 2 provides a simplified view of the CORBA architecture. When 
a client invokes an operation of a remote object, communication between 
the client and the server is performed through the Object Request Broker 
(ORB). The ORB offers a communication infrastructure independent of 
the underlying platform (machine and operating system). The client is 
connected to the ORB through a stub whereas it is done by a skeleton 
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for the server. Stubs and skeletons are generated automatically by an 
IDL compiler talcing as input the IDL specification of the object. Since 
CORBA hides the language used for the object implementation, an IDL 
compiler may generate stubs and skeletons for different languages (C++, 
Java, Smalltalk, . . . ). An object can thus be implemented in C++ and 
called by a client implemented in Java. A stub acts simply as a proxy 
object that behaves the same than the object implementation at the 
server side. Its role is to deliver requests to the server. Similarly, the 
skeleton is an object that accepts requests from the ORB and delivers 
them to the object implementation. 

4. PARALLEL CORBA OBJECT 

The adoption of CORBA for coupled simulation raises some difiicul- 
ties mainly when dealing with the coupling of parallel codes. Indeed, 
the encapsulation of parallel codes into CORBA objects has been seen 
as the major obstacle for the use of such a technology. Therefore, the 
integration of MPI-based parallel codes to existing CORBA infrastruc- 
tures is an important issue in many research and development projects. 
MPI-based parallel codes are mostly based on a SPMD execution model. 
With such a model, a parallel code is a set of identical processes running 
concurrently and exchanging data through the sending and the receiving 
of messages. The usual way of encapsulating such codes into CORBA 
objects is to adopt a master/slave approach as shown in figure 3. In that 
case, only one process, called the master, is encapsulated into a CORBA 
object. Other processes (the slaves) are connected to the master process 
through the MPI layer. The master may represent an important bottle- 
neck when two MPI codes, encapsulated into CORBA objects, have to 
communicate with each other. Indeed, the master process has to gather 
data from the slave processes, using MPI, and has to send them to the 
callee through the ORB. The callee will then call the other CORBA 
object that in turn will scatter the data to its slave processes. This 
approach does not offer a scalable solution to the encapsulation of par- 
allel codes. As the number of slave processes or the size of the problem 
(amount of data transmitted between two parallel codes) increases, it 
will entail a large overhead. To avoid such undesirable behavior, we ad- 
vocate the use of a new kind of CORBA objects we call parallel CORBA 
objects (PaCO). 

A parallel CORBA object is a collection of identical standard CORBA 
objects as shown in figure 4. Each CORBA object encapsulates a SPMD 
process of the parallel code. As our goal is to hide parallelism to the 
user, all the objects belonging to a collection are manipulated as a single 
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Figure 3 Master/slave approach 
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Figure 4 Parallel CORBA object 



entity. In this way, a parallel CORBA object is seen as a standard 
object from the client point of view. Therefore, when a client invokes 
a remote operation to a parallel CORBA object, the associated method 
is executed concurrently by all objects belonging to the collection. Such 
parallel execution is performed under the control of the stub associated 
with the parallel CORBA object. Since the stub behaves differently 
from the one associated with a standard CORBA object, we modified 
the way an IDL compiler generates stubs. Such modifications were made 
possible by enriching the IDL language with new constructs. This new 
IDL language is called Extended-IDL. Extensions allow users to specify a 
collection of objects and to add data distribution attributes to operation 
parameters. The following paragraphs describe the Extended-IDL using 
the example shown in figure 5. All the extensions added to the IDL 
as well as some restrictions, like interface inheritance, are presented in 
more details in [7]. 

Mapping of objects. The number of objects in the collection, that 

will implement the parallel object, is specified within the two brackets 
after the IDL keyword interface. There are several ways to fix the 
number of objects in the collection. The expression may be an integer 
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interface [*] MatrixOperations < 
const long SIZE = 100; 
typedef double Vector [ SIZE ]; 
typedef double Matrix [ SIZE ][ SIZE ]; 

void mult ( in distC BLOCK ][ ♦ ] Matrix A, 
in Vector B, 

out distC BLOCK ] Vector C ); 
csum double skal( in dist[ BLOCK ] Vector A ); 

}; 



Figure 5 Example of an extended-IDL interface 



value, an interval of integer values, a function or the symbol. This 
latter option means that the number of objects is chosen at runtime 
depending on the available resources (i.e. the number of computing 
nodes if we assume that each object is assigned to only one node). 

Data distribution. Data distribution is specified using the dist 
keyword before the type of each parameter to be distributed. In the 
previous example, operation mult has two parameters (matrix A and 
vector C) which are distributed. After the dist keyword, distribution 
mode for each array dimension is specified. Distribution modes are sim- 
ilar to the ones defined in HPF (High Performance Fortran). The 
indicates that the corresponding array dimension is not distributed. A 
non distributed parameter, as vector B, is replicated among each object 
of the collection. 

Collective operation. A collective operation is a simple way to 
perform computations on the values returned by the objects belonging 
to the collection. Collective operations are performed by the stub at 
the client side. Collective operations are allowed only on scalar types. 
Operation skal in the previous example, illustrates the use of this new 
extension. In this example, the csum keyword indicates that the value 
returned by the operation is the sum of all the values given by all objects 
belonging to the collection. 

4.1. STUB AND SKELETON CODE 
GENERATION 

A stub generated by the Extended-IDL compiler does more works than 
a standard stub. Indeed, it is in charge of invoking simultaneously the 
same operation on each object of the collection. It has also to handle 
data distribution. Moreover, when stubs are used within a parallel ob- 
ject, they are in charge of synchronizing invocations and redistributing 




112 ARCHITECTURE OF SCIENTIFIC SOFTWARE 



data [3]. It is important to note that a parallel flow of data can be main- 
tained between two parallel CORBA objects, allowing an eflicient use of 
high-performance networks (gigabit network) that connect computing 
resources together. Skeleton generated by the Extended-IDL compiler 
handles distributed data. A more detailed description of the parallel 
CORBA object concept can be found in [10]. 

4.2. IMPLEMENTATION 

To implement the Extended-IDL compiler, we modified the IDL com- 
piler provided by MICO [8] which is a freely available^ and fully compli- 
ant implementation of the CORBA 2.3 standard. In the current imple- 
mentation, the code generated by the Extended-IDL compiler cannot be 
used in conjunction with other CORBA implementations because they 
use specific methods of MICO. However, we are planning to modify the 
stub and the skeleton code generation process in a near future to make 
the generated code independent of the CORBA implementation. Mod- 
ifications to the MICO compiler have been done in such a way that we 
are able to handle new versions of MICO with little effort. 

5. COMPUTATIONAL ELECTROMAGNETIC 
APPLICATION 

For the design of its products, Aerospatiale Matra is developing Elec- 
tromagnetic simulation software for more than twenty years. Despite 
the fact that these codes are highly optimized, a parallel run, comput- 
ing the antenna interaction on a typical aircraft, may last for more than 
24 hours on an high end 16 PEs IBM SP. This is partly due to the fact 
that the same exact method is used to study the complete object though 
it would have been better (in term of performance) to use an appropri- 
ate method on each part of the system whenever possible. Moreover, 
Aerospatiale Matra is now evolving in such a challenging world that a 
competitor on one project may be a sub-contractor on another one. For 
these different reasons, a coupling method has been developed so that 
each part of the system may be studied with the most suited method. 
Moreover, each part can be considered as a black box and therefore can 
be simulated by its manufacturer without revealing any details to third 
parties. 

The chosen application aims at simulating the electromagnetic inter- 
action between two different physical objects such as between an antenna 
and an aircraft. The objective was to develop a coupled application 
so that each object could be simulated using a distinct computing re- 
source. The simulation of each physical object is carried out using the 
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Figure 6 Electromagnetic simulation 

ASERIS/BE simulation code from Aerospatiale Matra. A second ob- 
jective was to make very few modifications to this simulation code so 
that a single code can be used for both standard simulation and coupled 
simulation. Figure 6 shows a simple example of an electromagnetic sim- 
ulation involving two physical objects. This coupling method is based 
on an iterative process where the currents on an object due to the other 
parts axe successively computed until convergence is achieved. As the 
different geometries may remain private (for confidentiality reasons), a 
fictitious surface is placed in between the different objects. For a typical 
simulation, the following quantities are computed: the currents due to 
the two objects on the fictitious surface, the interaction of the surface 
on each of the two objects and the currents on each of the objects due 
to the external illumination and the surface. The stopping criterion is 
based on the difference on the computed currents between two successive 
iterations. 

The adding of a new object (the fictitious surface) requires the compu- 
tation of new quantities as mentioned previously. Instead of modifying 
the ASERIS/BE simulation code, we took the decision to develop two 
new parallel codes to compute the currents on the fictitious surface (loR 
code) and the interaction between the objects and the fictitious surface 
(PoR code). The different execution steps are illustrated in figure 7. 
Each physical object is simulated using a set of 5 codes: ASERIS/BE, 
loR, VecSum, PoR and TestStop. These five steps are executed sequen- 
tially for each object. However, the simulation of the two objects is 
carried out in parallel and it is synchronized at each time step. At the 
beginning of the simulation, the electromagnetic radiation of one object 
is computed by the ASERIS/BE code (step 1). This radiation is then 
projected to the fictitious surface by the loR code (step 2). The results 
of the two projections (one for each object) are combined by the Vec- 
Sum code (step 3) in order to compute the interaction between the two 
objects. This is done by the PoR code (step 4). The electromagnetic ra- 
diation of the object is then recalculated by the ASERIS/BE code (step 
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5). The last step (6) computes the convergence criteria to determine 
whether a new iteration is required to reach an equilibrium. 



Object 1 V Object 2 




1 : ASERIS/BE 
2 : loR 

3 : VecSum 
4:PoR 

5 : ASERIS/BE 
— 6 : TestStop 



Figure 7 Execution steps 



5.1. CODES COUPLING USING PARALLEL 
CORBA OBJECTS 

The first step to design a coupled simulation application is to encap- 
sulate the numerical codes into either CORBA objects (for the sequen- 
tial code) or parallel CORBA objects (for parallel codes). It requires 
the specification of the interface using the jBxtended-IDL language. We 
kept the interface as simple as possible since our constraint was to make 
little modifications to the original source codes provided by Aerospatiale 
Matra. The coupled application relies on the use of two CORBA ser- 
vices; the naming and the event services. The naming service has been 
modified slightly to support parallel CORBA objects. A symbolic name 
can be associated to a collection of object references instead of only one 
object reference. However the binding to a parallel CORBA object re- 
mains the same as for a standard CORBA object [10]. The event service 
is used to detect the termination of the coupled application. 

Figure 8 gives an overview of the coupling strategy. AS-ELFIP, Lo.R, 
P.O.R, TestStop and VecSum are parallel codes encapsulated into par- 
allel CORBA objects whereas CtrlLoop is a standard CORBA object. 
The simulation of the two physical objects can be carried out indepen- 
dently at each time step. Therefore, the software architecture is made 
of two kinds of scheduler. The master scheduler manages the overall 
application. It launches the two secondary schedulers by invoking an 
asynchronous {oneway) operation on each secondary scheduler (CORBA 
objects). Then, the master scheduler waits for events on an event chan- 
nel. The receiving of an event indicates the termination of the applica- 
tion. Those events are sent by the two secondary schedulers when the 
two simulation codes reach a convergence criteria. Secondary schedulers 
invoke sequentially all the encapsulated simulation codes. They syn- 
chronize the computation fiows through the VecSum parallel CORBA 
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object to exchange data values. The CtrlLoop CORBA object gathers 
convergence results (boolean value) from the TestStop parallel CORBA 
objects. It indicates whether the simulation process is completed by 
sending an event through an event channel. 

Coupling simulation and visualization. One objective was to 
perform the simulation coupled with the visualization of the interme- 
diate results at each time step. We developed a simple Java applet to 
control the execution of the application and to perform the visualization 
as soon as the results are available. One benefit of such approach is 
to let engineers to use the coupled application through their own com- 
puting system whatever the system is. Figure 9 shows the execution 
of the Java applet running within a Web browser. Visualization is per- 
formed thanks to the COUCH A code that takes as input the results 
from the ASERIS/BE code and produces a VRML file. The COUCHA 
code is a parallel code so that it has been encapsulated within a parallel 
CORBA object. The VRML file is read by the CosmoPlayer plug-ins 
and displayed within the Web browser window. Communication between 
the Java applet and the coupled application is performed through the 
CORBA ORB. 

5.2. PERFORMANCE 

We made several experiments to assess the performance of the cou- 
pling technique. These experiments were performed on a cluster of PCs 
(450 Mhz Pentium III processors) where each PC is connected to a Fast 
Ethernet network. Communication between the two sets of codes (left 
and right side in Figure 8) is performed using the CORBA ORB and 
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Figure 9 Monitoring the execution of the coupled application through a Web browser 

a NFS file system within which files are stored. Since the two sets of 
codes ran in parallel, it is expected to have a speedup for the coupled 
application. We noticed a speedup of 1.52 when using four processors for 
each set of codes (a total of eight processors for the whole application) 
compared to the use of 4 processors to run successively the two sets of 
codes (at each time step, the first object is simulated followed by the 
second one). 

6. CONCLUSION 

In this paper, we give a description of our experience in the coupling 
of simulation codes using parallel CORBA objects. One main bene- 
fit of this technology is to allow a simple encapsulation of codes into 
distributed objects. Such encapsulation does not require extensive mod- 
ifications to the original source codes. The main additional work was to 
design the primary and secondary scheduler to control the execution of 
the distributed simulation application. However, these schedulers were 
based on existing CORBA services (naming and event services) and thus 
were not so much time consuming to develop. It is worth mentioning 
that the encapsulated simulation code can be used within another cou- 
pled simulation application without modifying the existing interface. 

Notes 

1. The one acting as a bridge for the communication between different machines 
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2. Usually through a 100 Mb/s Ethernet link connected to a switch 

3. http://www.mico.org 
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DISCUSSION 

Speaker: Christoph Rene 

Vladimir Getov : Have you considered using the Interoperable Mes- 
sage Passing Interface (IMPI) as an alternative environment for your 
project? 

Christoph Ren4 : We are considering a distributed system where some 
of the nodes are parallel systems. Communication requirements are not 
the same where you have to communicate within a distributed system or 
within a parallel system. IMPI was designed to let MPI communication 
layers interoperate in a distributed system. IMPI is suitable if you would 
like to execute a single parallel application (MPI-based) on a distributed 
system. The distributed system is seen as a virtual parallel machine. 
In our case, we would like to support coupled applications: a set of 
different parallel codes connected together. We defend the idea that a 
MPI layer is not suitable when communicating between distinct codes 
in a coupled applications. In such a case, control and data have to 
be transfered between applications. MPI W 2 is designed to transfer data 
through message-passing but not the control. Moreover, in our approach 
we would like to have each parallel code associated with a description of 
its interface. IMPI does not provide an interface description language 
whereas CORBA IDL is a good candidate for such description. 
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Abstract 

A Problem Solving Environment (PSE) is a complete, integrated com- 
puting environment for composing, compiling and running applications 
in a specific problem area or domain. We describe a visual code de- 
velopment tool within a PSE, which enables computational scientists 
to construct applications by connecting components. The granularity 
of each component can vary from being a complete code, to a mathe- 
matical routine such as a matrix or PDE solver. We first outline the 
requirements of such an environment, illustrating these with our imple- 
mentation. The implementation of a computational electro-magnetic 
solver is then described using this code development tool, based on a 
2D boundary element code. We emphasise lessons learned, and the 
importance of using such an environment to support new application 
development. 



Keywords: collaborative software, problem-solving environments, distributed com- 
puting, computational electromagnetics 
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1. INTRODUCTION 

A Problem Solving Environment (PSE) is a complete, integrated com- 
puting environment for composing, compiling, and running applications 
in a specific area [10]. PSEs have been available for several years for cer- 
tain specific domains, but most of these have supported different phases 
of application development, and cannot be used cooperatively to im- 
prove a scientists’ productivity, primarily due to the lack of a framework 
for tool integration and ease-of-use considerations. 

The modern concept of a PSE for computational science [11] is based 
on the availability of high performance computing resources, coupled 
with specialised software tools and application specific knowledge. PSEs 
have the potential to greatly improve the productivity of scientists 
and engineers, particularly with the advent of web-based technologies, 
such as CORBA and Java, enabling access to remote computers and 
databases. 

The aim of our PSE is to provide the ability to build up scientific ap- 
plications by connecting or plugging software components together, and 
to provide an intuitive way to construct scientific applications. Hence, 
a PSE must contain: (1) application development tools that enable an 
end user to construct new applications, or integrate libraries from ex- 
isting applications, (2) development tools that enable the execution of 
the application on a set of resources. In this definition, a PSE must 
include resource management tools, in addition to application construc- 
tion tools, albeit in an integrated way. Based on the types of tools 
supported within a PSE, we can identify two types of users: (1) applica- 
tion scientists/engineers interested primarily in using the PSE to solve 
a particular problem (or domain of problems), (2) programmers and 
software vendors who contribute components to achieve the objectives 
of the category (1) users. The PSE infrastructure must support both 
types of users, and enable integration of third party products, in addi- 
tion to application specific libraries. In this paper we are concentrating 
on category (1) users. 

Our application interface to the PSE is called the Visual Compo- 
nent Composition Environment (VCCE). The VCCE contains a Pro- 
gram Composition Tool (PCT), which enables a user to construct sci- 
entific applications by combining components obtained from local or re- 
mote component repositories. All components have interfaces defined in 
XML, based on a PSE-wide data model. A project investigating the use 
of XML for defining component properties and interfaces is OSD [22], 
which supports ‘push’-based applications to automatically trigger the 
download of particular software components as new versions axe devel- 
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oped. Hence, a component within a data flow may be automatically 
downloaded and installed, when a new or enhanced version of the com- 
ponent is created. This approach is linked to event handlers, with spe- 
cific events to identify when a new version of a particular component is 
available. The use of this description format is principally aimed at in- 
stalling new versions of existing components, and does not facilitate the 
discovery or description of properties of a given component. Another 
component description scheme is IBM’s BeanML [3], which enables a 
user to describe the properties of a component, using a specialised data 
model, and which is subsequently translated to Java code. This descrip- 
tion scheme is primarily based on developing graphical components, and 
primarily aimed at Java. The Koala project [16] at INRIA is aimed at 
providing an object markup language, to enable serialization of a Java 
object. It has primarily focused on developing graphics applications, and 
has not been used for encoding properties of objects, such as execution 
constraints associated with a given object, or groups of objects. There is 
also work by the OMG in creating a CORBA Component Model (CCM) 
in XML, enabling an XML description to be automatically translated 
into CORBA IDL [5], and vice versa. A “component” is defined as a 
new basic meta-type in CCM, enabling components to be defined by 
extensions to standard IDL, and be either ‘standard’ or ‘extended’ com- 
ponent types. A CORBA component interacts with a ‘Container’, and 
provides support for call backs and a Portable Object Adapter (POA). 
Our component model is more generic, supporting both data types in a 
particular lanaguage (such as Java), and also execution specific details 
to be added to a component description. Hence, a component in our 
system can be a wrapped code, or compiled Java bytecode, with con- 
straints defining what the code needs to run (such as whether it is an 
MPI code, and therefore requires MPI libraries to be available on the 
system), and constraints on the types of platforms on which the code 
can be run. Components are wrapped as binary codes, and the source 
code is not manipulated in any way - as in many instances, the source 
code is proprietary and not accessible. In the case of compound compo- 
nents, binary codes are integrated together, and the source code of the 
individual components is not involved. Component repositories must 
be statically connected to the PCT, prior to launching the VCCE. A 
user can also register new components with the local repositories which 
contain instances of local or remote components. The component model 
and a more detailed description of the VCCE architecture can be found 
in [14]. 

Visual Progr amm ing is also an active research discipline, and various 
languages and tools have been developed within the community - a list 
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of projects can be found at [28]. In the context of a PSE, various vi- 
sual composition tools can be utilised, generally a user can construct 
an application by combining “program blocks” with a particular func- 
tion, such as in AVS, Khoros and IRIS Explorer. In these systems the 
emphasis is on integrating blocks written in a particular programming 
language (such as C), or a particular scripting language (such as Scheme 
or Python). Visual programming tools to support parallel program con- 
struction have also been investigated, primarily as tools to combine lan- 
guage blocks from a particular parallel programming library, such as 
HeNCE [15] for PVM, to support specialiased data partition for array 
based computations [2], or to enable the management and description 
of specialised data structures [4]. The visualisation in these latter envi- 
ronments has primarily been aimed at facilitating program development 
and integration from different modules. These environments are very 
specialised, and involve blocks containing low level descriptions of pro- 
gram statments that need to be combined to generate a single program. 
Higher level component composition environments also exist, which in- 
volve modules which can range from complete applications to specialised 
language statements - albeit for a specific language, and generally in 
the context of a particular application domain, such as CLEMENTINE 
(for data mining). Each of these environments supports constructs to 
combine ‘nodes’ into a data flow graph, and enable the construction 
either of new functionality, or of a complete application. These tools 
provide support for managing conditionals, loops and compound com- 
ponents to varying extents. In our system we borrow concepts of data 
flow based composition from some of these environments, but also en- 
able the description of specialised services, such as an ‘events’ service, 
which enables the execution of components either on a single machine, 
or a parallel computer. The event service interacts with a resource man- 
ager which can manage execution on a parallel machine. Our approach 
can therefore support binary components which support a specialised 
functionality (such as a mathematical library), or a complete applica- 
tion. We therefore borrow from existing work in visual programming 
languages, but provide support for checking component properties based 
on a system wide data model. Our approach is also more general, and 
can encapsulate approaches taken by systems specific to a given pro- 
gramming model, or to a specific application domain. We support the 
latter by providing specialised components that are specific to a given 
domain (such as components for reading from a structured database - 
for data mining [25]) and general purpose components for visualising the 
results of an experiment, and writing the results to a file, for instance. 
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1.1. RELATED WORK 

We outline some existing PSE projects, which have become popular, 
and employ some aspects of the infrastructure described previously: 

■ The Gateway project [9] introduces a component based system 
implemented using JavaBeans and utilising dataflow techniques to 
represent the application as a directed graph. The Gateway system 
chooses to use the Abstract Task Descriptor (ATD) as its lowest 
level of granularity of instruction and to build up the instructions 
that define the application. 

■ The Adaptive Distributed Virtual Computing Environment (AD- 
ViCE) project [13] is another system that provides a graphical 
user interface that enables a user to develop distributed applica- 
tions and specify the computing and communication requirements 
of each task within the task graph. Unlike the Gateway system, 
but similar to our own, the ADViCE system has its own scheduler 
that allocates tasks to resources at run time. 

■ The Arcade project [1] uses a slightly different approach in that the 
system has a three tier architecture, with the first tier consisting of 
a number of Java Applets that are used individually to specify the 
tasks (either visually or through a scripting language), to specify 
resource needs, and to provide monitoring and steering. Each of 
these Applets then interacts with a CORBA interface which in 
turn interacts with the final execution user modules distributed 
over a heterogeneous environment. 

■ SCIRun [17], [18] provides a programming environment to support 
interactive construction, de-bugging, and steering of large-scale 
scientific applications. The focus in SCIRun is on computational 
steering, supporting application, algorithm and performance steer- 
ing. 

■ The Distributed Problem Solving Environment Component Archi- 
tecture Toolkit (CAT) [19] is a component-based toolkit for in- 
tegrating heterogeneous software components. Aimed specifically 
at science and engineering, a CAT component can be dynamically 
inserted into the system and be made to interact with other CAT 
components, regardless of differences between architecture, oper- 
ating system, and programming language. The end-user interax;ts 
with this PSE through a graphical interface, which provides a vi- 
sual workspace in which components can be created and connected. 
Before the user can decide which components and machines to 
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employ, she must have access to information about the hardware 
and software resources available on the system. This facility is 
provided by the CAT Resource Information Service (RIS). The 
RIS comprises an “Information Server” which maintains an LDAP 
database, and stores hardware and software meta-data, and an 
“Information Browser”, a graphical tool packaged with the CAT 
that allows a user to search and browse the contents of the LDAP 
database. 

■ The Netsolve project [20] enables the user to define problems in a 
specialised language, not dissimilar to Matlab. Interfaces are also 
provided for Fortran, C and Java. Netsolve also supports access 
to both hardware and software based computational resources dis- 
tributed across a network, supporting load-balancing and resource 
discovery using a collection of interacting agents. 

■ The Parallel ELLPACK project [21] is a PSE for PDE based appli- 
cations. Implemented using the ELLPACK language and sequen- 
tial solver libraries, it also contains finite element methods, third 
party solvers, and a graphical interface for problem specification. 
Support is also provided for running the generated application on 
parallel machines. 

Other projects which share features of a PSE, but do not provide both 
a program integration/generation tool and a resource manager include 
PARDIS [23], PAWS [24], and various resource management systems. 
Based on existing projects, a PSE must therefore: (1) allow a user to 
construct domain applications by plugging together independent com- 
ponents. Components may be written in difierent languages, placed at 
different locations, or exist on different platforms. Components may be 
created from scratch, or wrapped from legacy codes; (2) provide a vi- 
sual application construction environment; (3) support web-based task 
submission; (4) employ an Intelligent Resource Management System to 
schedule and efficiently run the constructed applications; (5) maJce good 
use of industry standards such as middleware (CORBA), document tag- 
ging (XML); (6) must be easy for users to extend within their domain. 

2. BOUNDARY ELEMENT CODE 

The boundary element code described in this paper is called be2d. 
The code is a two dimensional boundary element simulation code for 
the analysis of electro-magnetic wave scattering. The main inputs to 
the program are a closed two dimensional contour and a control file 
defining the characteristics of the incident wave. The contour file con- 
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sists of a series of x,y coordinate pairs and is generated by a separate 
mesh generation program. The control file is a series of property values 
for the wave and consists of values for the wave frequency in Hertz, the 
wave direction in radians and a complex number representing the am- 
plitude. For the computation of the matrix elements, the code uses a 
two dimensional formulation of Rau-Wilton-Glisson elements [26]. The 
outer integrations use one-point quadrature, while the inner integrations 
use two-point quadrature. A direct LU decomposition solver is used for 
computing the field. 

The code is written in Fortran and to run the original version, the user 
first runs the mesh generator from the command line. This produces the 
data file that represents the two dimensional contour. The user then has 
to run the be2d solver from the command line, ensuring that the contour 
data file and the wave control file are in the same directory. The solver 
produces two output files, one containing the radar cross section data 
and the other the surface current. 

In the version of the code that we use from within the PSE, various 
parts of the code and data generators are wrapped as CORBA objects. 
The components that maJse up the complete assembled code axe: 

■ The mesh generator, for generating the ellipse curve which defines 
the example ’model’ or geometry for the beSd program. This is the 
original mesh generator with a CORBA wrapper that generates a 
CORBA object representing the data set, instead of writing the 
data to a file. 

■ The wave control component that defines the characteristics of 
the incidence wave, frequency and angle. This is a simple object 
representation of the control file. This component has two outputs, 
the frequency of the wave and it’s angle. 

■ The be2d boundary element solver. This is the original solver, 
modified to accept input data from CORBA objects and which 
outputs a radar cross section as a CORBA object. 

■ The database component, for storing the output from the solver. 
This component was not paxt of the original code, it taJces multiple 
output objects from the solver and stores them in a database. 

Each of these CORBA components is used through a CORBA server 
object, called the ActionFactory, based upon the Abstract Factory de- 
sign pattern [12]. This pattern abstracts the object creation and execu- 
tion details behind a factory interface, when the user selects a component 
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for instantiation, the PSE connects to an instance of the be2d ActionFac- 
tory server. The factory object is then responsible for instantiating and 
executing that component. In this way the PSE does not need to know 
details about object instantiation or execution. The Actionfactory has a 
single method execO which accepts a number of parameters, including 
the name of the process to execute, a set of input parameters and some 
details about the execution of the process, and returns a result set. 

A typical call to the ActionFactory would be: 

short [] result = factory_.exec(action_name_, 
numBddETag , 
parameter , 
f act_machine_name_ , 
f act.obj _name_ , 
f act_port_number_) ; 

Where action-name, represents the name of the command to be ex- 
ecuted, for instance ActionReadMesh is the action that generates the 
mesh object, the return value for which is a reference to a CORBA object 
representing the mesh. ActionComputeRCS is the action that executes 
the solver, it returns another reference to the CORBA object that repre- 
sents the radar cross section result set. numBddETag and parameter are 
arrays representing the input data for this component, the value of these 
is either a simple data type as in the case of the wave frequency and angle, 
or a reference to a CORBA object in the case of the mesh. The action 
factory is responsible for deliverying input datasets to the appropriate 
components to be executed to complete a given action. The final param- 
eters, fact jnachinemame., fact_obj .name .and fact .port .number .are 
values that the ActionFactory uses to decide what component to execute 
and where to run it. All of these values are stored in the XML com- 
ponent definitions, which are parsed by the PSE and represented by a 
proxy component. Hence, running a particular component is achieved 
by a single function call to the ActionFactory instance, found at compo- 
nent instantiation time, passing in the values stored in the proxy together 
with any output parameters from the previous component in the graph. 

3. USING THE PSE 

When the PSE is started, the VCCE first checks in the component 
directory or directories for all defined components. The directories that 
the application examines are defined in an application meta-data file. 
The XML component definitions are parsed, the proxy components cre- 
ated and added to the component tree ready to be selected by the user. 
Illustrated in Figure 1. 
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Figure 1 VCCE with loaded component directory 
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Figure 2 A constructed work flow graph 



To assemble a set of components into an executable task graph, the 
user simply selects a component, from the tree with the mouse, and 
clicks on the scratch pad on the right of the screen. This intuitive se- 
lection process has the same features as many visual programming or 
windows based environments, such as mouse based selection and “drag 
and drop” . Once the desired components have been selected and placed 
on the scratch pad, they can be connected together using the connection 
menu button. The user clicks this button then selects two components, 
parent first then child, the PSE then establishes a data flow connection 
from the parent to the child. Repeating this process allows the user 
to connect the components together into a task graph. The final task 
the user has to perform before executing the graph, is to assign a start 
node or nodes. A graph can have more than one start point, if there 
are two initial input generating components for instance. The assem- 
bled and connected be2d graph is illustrated in Figure 2. To execute the 
completed graph the user simply presses the start button to initiate the 
simulation. 

The solver is combined with a graph generator (JChart [27]) as a third 
party component. The output generated from the code is illustrated in 
figure 3. 

One of the features of the PSE is to provide the ability to perform 
iterations over components in a task graph. A more complex task graph 
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Figure 3 Solver Output 



containing two control components in addition to the components needed 
for the solver is illustrated in figure 4. When these control components 
are connected to another component, the user is prompted with a selec- 
tion of control input parameters that are suitable for iteration. In the 
case of he2d, the two input parameters to the wave component are the 
frequency and angle of incidence, being floating point values, these are 
suitable for iteration and there is a separate control component for each. 
The user can set the start, finish and iteration values for each parame- 
ter. When the graph executes, the PSE will loop over the components 
iterating the input parameters according to the user defined values. 

4. COMPONENT INSTANTIATION AND 
GRAPH EXECUTION 

The be2d components used as the example in this paper are CORBA 
components. When the components are instantiated by the PSE upon 
selection by the user, the PSE has to establish a connection to the 
CORBA ORB that has a running instance of the be2d factory and get 
a reference to the factory from that ORB. The XML component def- 
initions contain information about where to find the factory, the host 
machine, the port number and the name of the factory object as well 
as the CORBA version, in this case Orbacus.The PSE automatically 
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Figure 4 A work flow graph with control components 
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performs the connection to the ORB and retrieves a reference to the 
be2d ActionFactory when a proxy component is added to the scratch 
pad. When the proxy is told to execute by the PSE, it calls the execO 
method on the instance of the factory it has a reference to, see the 
previous section on the be2d code 2 for more specific details. 

In more general terms when a task graph is executed, each of the 
components in the graph is told to execute in turn, starting with the 
component or components that have been identified by the user as the 
starting point for the graph. After component has executed it sends 
back an event to the VCCE to say that it has finished executing. At 
this point the VCCE will transfer the output parameters from the com- 
pleted component to the input parameters of the next component to be 
executed and then call that components execute method. 

When the control components are introduced into a task graph, by 
connecting them to a suitable component, they provide the input to that 
component. In the case of the example in this paper, see figure 4, there 
axe two control components. One to control the angle of the wave and 
the other to control the frequency. These control components work in a 
similar manner to the traditional “for... next” loop in most progr amming 
languages, stepping through a series of values from a start value until 
an end value is reached incrementing by a set value. At each iteration of 
the loop the PSE checks that the halting condition has not been reached 
and then passes the current loop value to the connected component. In 
the case of the wave component, this value is simply passed straight 
through to it’s output parameter, where it becomes one of the input 
values for the solver. When, as in this example, there are more than 
one control components connected to a single component, the execution 
is equivalent to a nested loop and the execution will continue until the 
halting condition on the outer loop is reached. The inner loop value 
is reset to it’s starting value every time the outer loop performs an 
iteration. 



4.1. ERROR HANDLING 

Error handling in the PSE is undertaJcen by a Program Analysis Tool, 
which performs a set of checks on the components. It checks that: 



■ The data types for input and output ports on components are 
consistent. The first check involves ensuring that the syntax for the 
data types match, based on the XML based component description 
provided. It then checks that the cardinality of the data types 
match, and finally, whether the input/output is streamed into/out 
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of a component, or whether it needs to be read/written from/to a 
file. 

■ A log file exists for recording Execution errors, maintained local to 
the point where the component is being executed. Hence, regard- 
less of where the ActionFactory undertakes component execution, 
the log file is maintained at the same place where the execution is 
undertaken. 

■ Component constraints have not been violated by the component 
execution tool. These constaints can relate to the availability of 
specialised programming libraries (such hs MPI or PVM), the ex- 
istence of specialised operating systems (such as Solaris) or the 
availability of specialised system requirements, such as memory. 

Since the component source code is not modified, verification of com- 
ponent behaviour is not undertaken within the PSE. It is assumed that 
each component has been verified and operates accorded to specifica- 
tions provided by the component developer. A user can however place 
specialised components, such as a ‘loop’ components to undertake pa- 
rameter tests, prior to using the component in an application. Results 
of these runs can be recorded into a file, and analysed for discrepancies 
between the specified output (by the component developer) and the ex- 
pected output (by the component user). The output can also be analysed 
by specialised statistical components to undertake correlations between 
difierent experimental runs. 

5. SUMMARY AND CONCLUSIONS 

A PSE is aimed at supporting an application scientist in solving a 
problem within a given application domain. The “problem solving” 
process involves a range of activities from both a user’s point of view, 
and a systems point of view. The intention being to abstract details of 
software and hardware, which correspond to the system point of view, 
from the application scientist. The user’s point of view relates to being 
able to specify the problem in a decompositional manner, whereby an 
application is composed of interacting components, each of which un- 
dertakes a particular function. The data fiow approach of combining 
components to compose applications is perhaps the most intuitive way 
to construct applications, and has been used by a range of other tools 
(such as AVS Explorer and Khorous) - others being a “declarative spec- 
ification”, “script based specification”, and a “high-level programming 
language based specification”. We feel the graphical approach adopted 
within our PSE is more generic, and can be used to extract a script 
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in XML. We provide support for handling “conditionals” and “itera- 
tors” within the data flow approach, in addition to hierarchy to com- 
pose “compound” components. Additional details of these can be found 
in [7]. Wrapping legacy codes as components within a PSE is a non- 
trivial undertaking, particular when making use of CORBA and Java 
based implementations. For instance, in order to support interoperabil- 
ity across programming languages, CORBA supports the minimal subset 
of data types across these. This could lead to data type incompatibility 
when wrapping Fortran or C codes, in terms of numerical precision and 
supported operations on particular data types such as complex numbers. 
The way in which legacy codes are wrapped can also affect the reusabil- 
ity of the resultant component. Wrapping the entire code as a single 
monolithic component is more straightforward, but smaller decomposed 
components may be more effectively reused in this way [8]. 

The VCCE simplifies the process of running a complex scientific code, 
using the intuitive visual programming paradigm. The application scien- 
tist does not need to configure software components, and can concentrate 
on undertaking parameter runs or visualising output from a solver. Var- 
ious components are provided to achieve some of these functions, which 
may be local to the scientist or refer to components held at other sites. 

Although there is an obvious overhead in having a legacy code 
wrapped as a CORBA object, the cost is not as great as it might appear. 
We have undertaken performance comparisons of wrapped legacy codes 
on both workstation clusters [8] and dedicated parallel machines [6]. 
The most time consuming part of using a CORBA object is the initial 
“handshaking” with the ORB. The VCCE performs the CORBA con- 
nection at component instantiation and not execution time. The user 
is still performing the process of building the graph at this time, so the 
cost in time is not really noticed. Once the graph comes to execution, 
the CORBA connections are already in place and the speed of execution 
is not affected by a discernible amount compared to the original code 
executed via the command line. 
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DISCUSSION 

Speaker: David Walker 

Masaaki Shimasaki : Does VCCE provide the functionality in com- 
position of software (i.e., network programming) similar to that of the 
AVS system for visualization? 

Richard Fateman : Can you comment on the extent to which the 
PSE/visual programming environment design has drawn on the decades 
of experience in the programming language design community, in partic- 
ular, solutions to issues of functional composition, information hiding, 
name spaces, scope, exception handling? 

David Walker : Visual Programming is an active research discipline, 
and various languages and tools have been developed within the commu- 
nity - a list of projects can be found at http : //cui .unige . ch/Visual/. 
In the context of a PSE, various visual composition tools can be utilised, 
generally a user can construct an application by combining “program 
blocks” with a particular function, such as in AVS, Khoros and IRIS 
Explorer. In these systems the emphasis is on integrating blocks written 
in a particular programming language (such as C) , or a particular script- 
ing language (such as Scheme or Python). Visual programming tools to 
support parallel program construction have also been investigated, pri- 
marily as tools to combine language blocks from a particular parallel 
programming library, such as HeNCE for PVM, to support specialiased 
data partition for array based computations, or to enable the manage- 
ment and description of specialised data structures. The visualisation in 
these latter environments has primarily been aimed at facilitating pro- 
gram development and integration from different modules. These envi- 
ronments are very specialised, and involve blocks containing low level 
descriptions of program statments that need to be combined to gener- 
ate a single program. Higher level component composition environments 
also exist, which involve modules which can range from complete applica- 
tions to specialised language statements - albeit for a specific language, 
and generally in the context of a particular application domain, such 
as CLEMENTINE (for data mining). Each of these environments sup- 
ports constructs to combine ‘nodes’ into a data flow graph, and enable 
the construction either of new functionality, or of a complete applica- 
tion. These tools provide support for managing conditionals, loops and 
compound components to varying extents. In our system we borrow 
concepts of data flow based composition from some of these environ- 
ments, but also enable the description of specialised services, such as 
an ‘events’ service, which enables the execution of components either 
on a single machine, or a parallel computer. The event service inter- 
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acts with a resource manager which can manage execution on a parallel 
machine. Our approach can therefore support binary components which 
support a specialised functionality (such as a mathematical library), or a 
complete application. We therefore borrow from existing work in visual 
programming languages, but provide support for checking component 
properties based on a system wide data model. Our approach is also 
more general, and can encapsulate approaches taken by systems specific 
to a given programming model, or to a specific application domain. We 
support the latter by providing specialised components that are specific 
to a given domain (such as components for reading from a structured 
database - for data mining) and general purpose components for visu- 
alising the results of an experiment, and writing the results to a file, for 
instance. 

Anne Trefethen : When combining a set of modules into a single 
hierarchical module are you creating a single interface to those modules 
or are you combining the modules by manipulating their somce code in 
some way? 

Bruce Char ; Are there any major efforts in describing components 
(through XML or other means) outside of the scientific computation 
community that problem-solving environment builders should be aware 
of? How close are we to having component description standards that 
would allow components such as commercial document processing or 
databases to be incorporated in this kind of PSE architecture? 

David Walker : A project investigating the use of XML for defin- 
ing component properties and interfaces is OSD, which supports ‘push’- 
based applications to automatically trigger the download of particular 
software components as new versions are developed. Hence, a compo- 
nent within a data flow may be automatically downloaded and installed, 
when a new or enhanced version of the component is created. This ap- 
proach is linked to event handlers, with specific events to identify when 
a new version of a particular component is available. The use of this 
description format is principally aimed at installing new versions of ex- 
isting components, and does not facilitate the discovery or description of 
properties of a given component. Another component description scheme 
is IBM’s BeanML, which enables a user to describe the properties of a 
component, using a specialised data model, and which is subsequently 
translated to Java code. This description scheme is primarily based on 
developing graphical components, and primarily aimed at Java. The 
Koala project at INRIA is aimed at providing an object markup lan- 
guage, to enable serialization of a Java object. It has primarily focused 
on developing graphics applications, and has not been used for encod- 
ing properties of objects, such as execution constraints associated with 
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a given object, or groups of objects. There is also work by the OMG 
in creating a CORBA Component Model (CCM) in XML, enabling an 
XML description to be automatically translated into CORBA IDL, and 
vice versa. A “component” is defined as a new basic meta-type in CCM, 
enabling components to be defined by extensions to standard IDL, and 
be either ‘standard’ or ‘extended’ component types. A CORBA com- 
ponent interacts with a ‘Container’, and provides support for call beicks 
and a Portable Object Adapter (POA). Our component model is more 
generic, supporting both data types in a particular lanaguage (such as 
Java), and also execution specific details to be added to a component de- 
scription. Hence, a component in our system can be a wrapped code, or 
compiled Java bytecode, with constraints defining what the code needs 
to run (such as whether it is an MPI code, and therefore requires MPI 
libraries to be available on the system), and constraints on the types of 
platforms on which the code can be run. Components are wrapped as 
binary codes, and the source code is not manipulated in any way - as 
in many instances, the source code is proprietary and not accessible. In 
the case of compound components, binary codes are integrated together, 
and the source code of the individual components is not involved. Com- 
ponent repositories must be statically connected to the PCT, prior to 
launching the VCCE. A user can also register new components with the 
local repositories which contain instances of local or remote components. 
Richard Fateman : It would be useful for designers to be conversant 
with issues of functional programming, functions as first-class objects, 
lexical scope. One reference would be the text “Structure and Interpre- 
tation of Computer Programs” by H. Abelson and C. Sussman, McCraw 
Hill/MIT Press. 

David Walker ; We agree that functional programming languages pro- 
vide an elegant way of describing composition, and in fact have been the 
basis of other work, such as “algorithmic skeletons” . The functional pro- 
gramming paradigm is however difficult to grasp for many non-computer 
scientists, and there is little suggestion that functional programming 
languages, such as Haskell, Miranda and Hope, have achieved the adop- 
tion compared to C or Java. Our emphasis is on visual programming, 
whereby applications can be visually constructed from blocks. There 
is a possibility, however, of translating the visual representation into a 
functional language, and a translator can be written to achieve this. 
Mladen Vouk : Errors committed by users during specification and 
solution engineering using problem-solving environments can be very 
costly. Standard software engineering teaches us that verification and 
validation (V&V) should be an integral part of the specification pro- 
cess (this includes both domain-specific, math-specific, and environment- 
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predicated V&V). Your system does not appear to have any explicit 
V&V hooks, tools, or process points. Why? Do you plan to incorporate 
them into the system in the future? 

David Walker : Error handling in the PSE is undertaken by a Program 
Analysis Tool, which performs a set of checks on the components. It 
checks that: 

■ The data types for input and output ports on components are 
consistent. The first check involves ensuring that the syntax for the 
data types match, based on the XML based component description 
provided. It then checks that the cardinality of the data types 
match, and finally, whether the input /output is streamed into/out 
of a component, or whether it needs to be read/ written from/to a 
file. 

■ A log file exists for recording Execution errors, maintained local to 
the point where the component is being executed. Hence, regard- 
less of where the ActionFactory undertakes component execution, 
the log file is maintained at the same place where the execution is 
undertaken. 

■ Component constraints have not been violated by the component 
execution tool. These constaints can relate to the availability of 
specialised programming libraries (such as MPI or PVM), the ex- 
istence of specialised operating systems (such as Solaris) or the 
availability of specialised system requirements, such as memory. 

Since the component source code is not modified, verification of com- 
ponent behaviour is not undertaken within the PSE. It is assumed that 
each component has been verified and operates accorded to specifica- 
tions provided by the component developer. A user can however place 
specialised components, such as a ‘loop’ components to undertake pa- 
rameter tests, prior to using the component in an application. Results 
of these runs can be recorded into a file, and analysed for discrepancies 
between the specified output (by the component developer) and the ex- 
pected output (by the component user). The output can also be analysed 
by specialised statistical components to undertake correlations between 
different experimental runs. 

Masaaki Shimasaki : In your development of problem-solving envi- 
ronments, are there any specific design features that relate directly to 
electromagnetics computation or to the specific needs of end users? 
David Walker : Our PSE contains specialised components used for 
supporting electromagnetic applications. These include: 
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■ The mesh generator, for generating the ellipse curve which defines 
the example ’model’ or geometry for the he2d program. This is the 
original mesh generator with a CORBA wrapper that generates a 
CORBA object representing the data set, instead of writing the 
data to a file. 

■ The wave control component that defines the characteristics of 
the incidence wave, frequency and angle. This is a simple object 
representation of the control file. This component has two outputs, 
the frequency of the wave and it’s angle. 

■ The be2d boundary element solver. This is the original solver, 
modified to accept input data from CORBA objects and which 
outputs a radar cross section as a CORBA object. 

■ The database component, for storing the output from the solver. 
This component was not part of the original code, it takes multiple 
output objects from the solver and stores them in a database. 

Each of these CORBA components is used through a CORBA server ob- 
ject, called the ActionFactory, based upon the Abstract Factory design 
pattern. 
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Abstract A distinguished feature of scientific computing is the necessity to de- 
sign software abstractions for approximations. The approximations are 
themselves abstractions of mathematical models, and the mathematical 
models are also abstractions of the real world. 

In this paper, the relation between different mathematical abstrac- 
tion levels and scientific computing software is discussed, in particular 
with respect to the simulation of partial differential equations. By ap- 
plying software engineering practices already when the mathematical 
model is considered, a coordinate-free formulation is readily motivated. 

For partial differential equations, the continues layer is identified as 
separate from the the coordinate-free layer. By mapping these layers 
into different software modules, numerical discretization carried out in 
the continuous layer is cleanly separated from the coordinate-free layer. 
This separation of concerns increases modularity because reusability is 
promoted. 

It is therefore concluded that the continuous and coordinate-free ab- 
straction layers provide a solid foundation for software that simulates 
partial differential equations. 

Keywords: Abstraction, coordinate-free numerics, object-oriented modeling, PDE 
simulation. 



1. INTRODUCTION 

Applications in the area of scientific computing span a wide range 
of problem domains, and each domain exhibits its own challenges to be 
addressed. However, one distinctive feature of scientific computing appli- 
cations is the need to deal with approximations of continuous structures. 
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The obvious example is that the infinite number of reals on any interval 
must be approximated on a computer, normally by floating point num- 
bers of appropriate precision. But even this basic example shows that 
scientific computing is a science where nothing is certain, since it is also 
possible to use, say, integers to represent real values, as is often done 
in image processing. Another example of a continuous approximation 
is the abstraction of an angle. In order to become a useful component, 
this abstraction must recognize that different real numbers, say tt and 
37 t, may represent the same angle, and also that different units such as 
radians or degrees may be used. These facts are of a different nature 
than the fact that the same real number may be approximated by dif- 
ferent discrete representations. The understanding of abstractions and 
their approximations is a key aspect for scientific computing, in order to 
obtain modular software architectures. 

In this paper, we promote the philosophy that the design of applica- 
tions for scientific computing should be based firstly upon continuous 
abstractions and secondly on approximations. This is nicely illustrated 
by the angle abstraction having some continuous features, independent 
of a particular choice of real number approximation. An angle abstrac- 
tion should also be general with respect to the choice of units. The 
same observation applies to physical quantities. A velocity vector, for 
instance, should be independent of any particular unit or any particular 
coordinate system. Consequently, we identify two different mathematical 
abstraction levels, a continuous abstraction level and a coordinate-free 
abstraction level. 

The aim of this paper is to discuss the role mathematical abstractions 
have concerning the modularity of the software architecture. We focus 
on the simulation of partial differential equations (PDEs). The problem 
domain of PDE solvers is demanding, since it includes the need to eval- 
uate approximate solutions of mathematical models, a request for high 
performance, and the necessity to utilize physically motivated simplifica- 
tions where possible. In the Sophus project [9, 11] we find that a design 
which is based upon continuous structures is more modular than a de- 
sign based upon discrete approximations of the continuous structures. A 
software architecture based on coordinate-free abstractions — coordinate- 
free numerics [15] — has additional flexibility. Here, PDEs can be written 
independent of the number of space dimensions and the choice of coor- 
dinate system, two properties that often change in applications. 

There are many related initiatives in the scientific computing com- 
munity that address software abstractions for PDE solvers, for in- 
stance [1, 5, 6, 20]. The emphasis on continuous abstractions is also 
noted in [16] regarding the modeling of computational geometries. For 
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geometric integration [12] of ordinary differential equations, Diffman pro- 
vides a coordinate-free software package [7]. Coordinate-free numerical 
optimization is also emphasized in [8]. To some extent, coordinate-free 
differentiation of PDEs is also addressed in Overture [4] concerning com- 
posite curvilinear grids. 

The remainder of this paper is organized as follows. In Section 2 we 
discuss some mathematical abstractions and introduce the notion of dif- 
ferent mathematical abstraction levels, giving examples from the area of 
PDEs. In Section 3 we note a few implications for software architecture. 
Our concluding section points out that coordinate-free mathematical 
concepts are essential for the modularity of PDE simulating software. 

2. MATHEMATICAL ABSTRACTIONS 

In software engineering, a number of object-oriented methodologies 
have been advocated in order to obtain software systems with high re- 
liability, robustness, modularity, understandability, and so forth. See 
for example [3, 13, 17]. Some of the most important principles used in 
object-oriented modeling are information hiding, abstraction, polymor- 
phism, and aggregation. Moreover, as a general guide-line, the software 
is to be based on concepts in the real world that we want to model. 

In the context of PDE solvers, these principles imply that the software 
abstractions must come from the mathematical domain of PDEs. This 
is recognized by all relevant software projects, including those previously 
mentioned in the introduction. However, we believe that the modular- 
ity of the final application is highly dependent on which mathematical 
abstractions one chooses as input for the design. In order to choose ap- 
propriate mathematical abstractions, we think that software engineering 
practices must be applied already when modeling the mathematical do- 
main. In this section we illustrate this idea and we motivate coordinate- 
free numerics by pure continuous considerations, see Section 2.4. 

As a first step we divide our problem domain of PDEs into differ- 
ent layers and subsystems. These partitions should ideally correspond 
to software modules, aiming for a good modularity. We identify the 
following levels: 

1 Manifold abstractions 

2 Functions on manifolds and partial derivatives 

3 Coordinate-free tensor abstractions 



4 Equation abstractions 
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Below, we discuss these levels further. Our presentation will recapitulate 
some basic mathematics for PDEs, mostly from a coordinate-free point 
of view. For a more thourough introduction, we recommend [19]. 

2.1. MANIFOLD ABSTRACTIONS 

The manifold is a central concept for PDEs. The manifold is the 
space where the entities of the PDE are defined and where the solution 
of the PDE evolves. Examples of manifolds axe three-dimensional space, 
four-dimensional space-time, phase-space, or the sphere. We want to 
study the common continuous properties of manifolds, because, as Shutz 
notes [19, p. 23]: “All these spaces have different geometrical properties, 
but they all share something in common, something which has to do 
with their being continuous spaces rather than, say, lattices of discrete 
points.” 

Mathematically, a manifold M with dimension n is defined as a set 
of points where each point of M has an open neighborhood which has 
a continuous 1-1 map onto an open set of 5ft”. Thus, every manifold is 
locally ‘like’ 5ft”. On a very “primitive” manifold, the notion of distance is 
not needed, but we will assume that we deal with Riemannian manifolds 
with a metric. 

2.2. MANIFOLD FUNCTIONS AND 
PARTIAL DERIVATIVES 

A PDE is a relation between various tensor fields and their derivatives. 
In this section, we first discuss the auxiliaxy concept of a “manifold 
function” . We then discuss partial derivatives. 

■ A manifold function Manifold Fcn<T> is parametrized over a type 
T. It is defined as having a value of type T in every point of a 
particular manifold. It can be thought of as a continuous array 
with the same dimension as the maxiifold. The type T must have 
all the usual arithmetic operations, addition, multiplication etc. 
These operations may the be “lifted” pointwise from type T to 
type Manifold Fcn<T>. This corresponds to a significant raise 
of abstraction level, allowing us to simultaneously manipulate all 
values in a Manifold Fcn<T> object. 

■ In order to treat partial derivatives, we introduce PD scalar field, a 
scalar field which only has partial derivatives. It can be thought of 
as Manifold Fcn<Scalan:> with additional operations for partial 
differentiation. Thus, the first category of PD scalar field oper- 
ations is the same as for manifold functions, obtained by lifting 
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Figure 1 Relations between different abstractions for PDEs. The PD scalar field 
consists of several scalars, one for each point in the manifold. A tensor has several 
scalar components. For a tensor field tf, we require tf [p] [c] = tf [c] [p], and a 
tensor field can be seen either as having a tensor in each point or a scalar field for 
every component. 

arithmetic operations on scalars. The second category, the differ- 
entiation operations, study how the scalar field changes between 
different points. PD scalar field is illustrated in the lower part of 
Figure 1 using UML [18]. We have exemplified arithmetic opera- 
tions with the -t- operator, and we denote with ddx partial deriva- 
tive operation, d/dx^. 



2.3. COORDINATE-FREE ABSTRACTIONS 

The principle that the underlying physical laws shall be independent 
on the choice of coordinate system is fundamental for our understanding 
of the world. Therefore, it is beneficial to use coordinate-free formula- 
tions of the laws. Such formulations are based on tensors and derivatives 
of tensors. 

Briefiy, a tensor of order n in a point p of a manifold with dimension d 
can be represented with cP components. The basic arithmetic operations 
must hold for its components. As for manifold functions, arithmetic 
operations may be component-wise lifted to tensors. Moreover, a tensor 
may be applied to another tensor, yielding a third tensor. Tensors can 
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therefore be regarded as multi-linear mappings between tensor spaices. 
See equation (1.2) below for an example. Familiar examples of tensors 
are vectors, which have order 1, and scalars, which have order 0. 

A tensor field is like a Manifold Fcn<Tensor> with additional oper- 
ations for coordinate-free derivation. The operations of interest in the 
tensor field interface come from the same categories as for the scalar field: 
the lifting of operations from the tensor abstraction, and derivatives. In 
the top of Figure 1, we have listed some coordinate-free derivatives in the 
Tensor field interface. A Lie derivative has as a parameter a tensor field 
of order 1, a vector field. We have also listed div = V-, the divergence. 
In order to compute the coordinate-free derivatives, the tensor field uses 
the metric tensor which holds information about the coordinate system. 

Regarding the lifting of arithmetic operations, we note in Figure 1 
that we can lift for instance the -I- operator either from the tensor or 
from the scalar field. These results must be identical. 



2.4. EQUATION PACKAGES 



Above, we have concentrated on mathematical definitions of basic 
mathematical concepts. Here, we compare the coordinate-free approach 
with a component dependent approach. 

As a motivating example, we use the wave equation for an elastic 
medium. A typical industrial application of the wave equation is to 
simulate seismic waves in oil reservoirs [2]. It is a standard equation 
from mathematical physics, and may be found in any text book on the 
subject, for instance [14]. In its most compact version, the coordinate- 
free form, it may be stated as: 



d^u 

o 

e 



V • a -f /(<) 


(1.1) 


A(e) 


(1.2) 


C'v.g' 


(1.3) 



Here, p is a scalar field, u is the displacement vector to be simulated, / 
is a time-dependent forcing function, <7, e, A and g are tensors of order 
two (cr, e and g) and four (A). The equations also involve derivation 
with respect to time, the divergence V- and the Lie derivative 
In order to interpret the equations, we may assume Cartesian coordi- 
nates, and the equations may then be reformulated in component form 
as: 



d'^Ui 



^dxk 
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Here, the component indices i and j and the summation indices k and I 
vary over the number of space dimensions. 

Sometimes, the model we simulate is axi-symmetric. When simulat- 
ing oil reservoirs, this is a typical assumption in the vicinity of a bore 
hole. The problem may then be simplified in cylindrical coordinates by 
observing that u{r, 6, z) is independent of 6. The change of coordinate 
system has implications for the metric tensor and thus the derivatives 
involved. The coordinate-free formulation remains intact, but the com- 
ponent form must be reformulated. Introducing auxiliary functions dj 
and Lij for the derivatives in cylindrical coordinates, we write 



d'^Ui 

dt^ 


— di + fi 
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To keep the presentation simple, we do not present explicit formulas for 
the derivatives. The point is that the formulas change. One can derive, 
for example, that Lqq in this case is given by Ur/r, and not by dug f 89, 
as equation (2.3) suggests. 

The fact that the coordinate-free formulation remains intact is the 
key motivation to the coordinate-free approach. We believe the invari- 
ance of (1) to be significant. We illustrate the situation in Figures 2 
and 3. These UML diagrams depict two tentative UML models of the 
wave equation. We believe that already in this early stage of the de- 
sign process, the two models have significant differences with respect to 
modularity. 

Figure 2 illustrates a coordinate-free mathematical model. We model 
the wave equation (1) as an aggregate of its three equations. Each indi- 
vidual equation is expressed using tensors. These tensors are associated 
with the metric of the manifold, needed for computing differentiation 
expressions. 

Figure 3 illustrates the component equations (2) and (3). The main 
difierence is that a coordinate dependent interface is used to express 
the equations, and the equations are explicitly formulated using par- 
tial derivatives. In order to represent both Cartesian coordinates and 
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Figure 2 A coordinate free mathematical model is based upon equations that use a 
coordinate-free interface to tensor fields. A change of coordinates affect the tensor 
metric, but the equations remciin untouched. 



cylindrical coordinates, we are now forced to have different versions 
for equations in different coordinate systems. Even if the equation 
aij = ^kiA-ijkieki is invariant under a change of coordinate system, 
we mean that the dependence of equations on the coordinate system 
reflects a poor model, in the sense that it is sensitive to changes. The 
coordinate-free diagram, on the other hand, represents a more robust 
model. This model is better suited as input for software design. 

3. IMPLICATIONS FOR THE SOFTWARE 
ARCHITECTURE 

In earlier sections, we have discussed modeling of the PDE domain, 
from a continuous perspective. We have shown the importance of soft- 
ware engineering concepts already at this early stage of the design pro- 
cess. In this section, we discuss the transition from the continuous model 
to a discrete model, necessary in order to simulate PDEs on a computer. 
The final software design is of course dependent on the choice of dis- 
cretization techniques. We believe, however, that if the design is based 
on the continuous and coordinate-free abstractions presented earlier, a 
software architecture for PDE solvers can be obtained which is robust 
with respect to different choices of coordinates and discretization tech- 
niques. 
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Figure 3 A mathematical model based on partial derivatives makes the equations 
coordinate dependent. The metric of the manifold is in this case intermingled with 
the equation formulation. Note that (2.2) is the same equation as (3.2), which is why 
WaveEqS reuses Eq2.2. 



Table 1 Different continuous abstractions lead to different discrete abstractions. 



Continuous Discrete 



U = U(x) Uij = u{xij) 

U = Q^/(x) U = Eili Q^z(x) 



3.1. DISCRETIZATION 

When discretizing, it is important that the discrete model represents 
the continuous model. Several choices has to be made, and we exemplify 
here that different continuous models yield different discrete models. 

Consider a scalar field u = u(x),x £ M. C 3?^. Of course, we can 
discretize the manifold, thus obtaining a discrete scalar field u = u{xij). 
Another continuous model of a scalar field is to treat it as an infinite 
linear combination of basis functions: u — ci^i- By terminating 
the infinite series, see Table 1, we discretize using a completely other 
mechanisms. The two different mechanisms may be observed to corre- 
spond to rounding (discretizing the underlying manifold) and truncation 
(terminating an infinite series). 

Most PDE solvers discretize scalar fields and tensor fields by “lifting” 
a discretization of the manifold. Many projects, including Sophus [10], 
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provides both serial and parallel versions of the manifold. In this respect, 
the manifold is proven to be a reusable and flexible software module. 
However, it is not easy to mix manifold modules from different projects. 

Besides discretizing the manifold, discretization of derivatives must 
also be addressed. According to our results from the previous section, 
PD scalar fleld is the proper module for computing derivatives. With a 
proper interface, we may easily provide different implementations. For 
example, Sophus provides both standard finite differences and finite dif- 
ferences for a staggered grid, see Figure 4. 

3.2. TENSOR FIELD ABSTRACTIONS 

Compared with PD scalar fields, tensor fields are on a higher level 
of abstraction. Developing a good tensor module is a challenging task. 
Still, in one respect the tensor field is “easier” to implement, since it can 
be developed without explicit reference to the discretization. Achieving 
this goal requires some care, as discussed in this section. 

Considering Figure 1 on page 149 again, we note that there are two 
seemingly equivalent ways to represent tensors, either as an aggregate 
of scalar fields or as an aggregate of tensors. As a concrete example, we 
consider a vector field on a two-dimensional manifold: 

u(r,y) = (ui(x,y),U 2 (x,y)). 

It can be regarded as an “array” with three indices. The first is dis- 
crete and picks out the component, and the other two “indices” are 
the continuous coordinates. A discrete representation, based on Carte- 
sian coordinates, would also give a three-dimensional array structure: 
Uiji = ui{xi, yj). It seems as if we have two equivalent ways to construct 
tensor field data: 



ManifoldFcn<Tensor<Scalaur», 



( 4 ) 



and 

Tensor<ManifoldFcn<Scalar». (5) 

In other words, we can either construct data for a Tensor Field as a 
Manif oldFcn over a Tensor or as a Tensor over a Manif oldFcn. 

To spot the difference, we must take derivatives into account. If con- 
struct (4) would be used, all differentiation on the tensor field would have 
to be developed from scratch. This is related to the fact that two tensors 
in different points belong to different tensor spaces. The consequence is 
that if we change the underlying discretization of partial derivatives, all 
code concerning derivatives is changed in the tensor module. 
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«type» 

TensorField 



4c 



«implements» 


«class» 


^ - 4C<implements» 


TensorFieldCart 




" ^ ^ ^ 


«class» 




TensorFieldCyl 



«type» <1 

PDScalarField 



«implements» 


«class» 


- 4C<implements» 


ScalarFieldFD 






«class» 

ScalarFieldFDStag 



Figure 4 Different numerical discretizations may be supported with different scalar 
field implementations. Similarly, efficient tensor implementations for common coor- 
dinate systems may be provided. 



The second construct is more robust.We introduce PD scalar field in 
place of ManifoldFcn. A Tensor<PDScalarField> need to know the 
metric of the manifold in order to compute the coordinate-free deriva- 
tives, but the numerical discretizations are hidden from the tensor field 
abstraction. Division of responsibilities is crucial for modular software. 

In practical computations, the computations can often be simplified if 
the metric is given. It may therefore be interesting to implement special 
purpose tensors for commonly used coordinate systems, see Figure 4. 
The discrete differentiation is taking place only within the discrete PD 
scalar fields. The tensor abstraction involves components, which are 
inherent in the continuous domain. 

4. CONCLUDING REMARKS 

The role of mathematical abstractions is discussed, and we argue that 
the correct treatment of these axe important in order to design modular 
software architectures for scientific computing. 

Particularly, we investigate the role of mathematical abstractions in 
the context of simulating partial difierential equations, PDEs. By em- 
phasizing software engineering practices already when analyzing the 
mathematical domain, we identify several levels of mathematical ab- 
stractions and subdomains. Obtaining a robust mathematics model is 
the first step towards a sound software architecture. 
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These ideas have been used for designing Sophus, a PDE solver library 
built upon various mathematical abstractions. Particularly, we point 
out the separation of concerns between the tensor field module and the 
partial derivatives module. The former belongs to the coordinate-free 
layer and is the module where the metric of the manifold is used. The 
latter belongs to the continuous layer and is the module where numerical 
discretization of derivatives is performed. This separation of concerns is 
the key for achieving high modularity. 

We find mathematical abstractions useful for the design of fiexible 
PDE solvers. It would also be of interest to investigate the role of math- 
ematical abstractions in other domains of scientific computing, for ex- 
ample visualization, optimization, or statistics. 
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DISCUSSION 

Speaker: Krister Ahlander 

Robert van de Geijn : One interpretation of your talk is that a 
code implementing some component of a scientific computation should 
be more than an implementation of the solution method. It should be 
designed, structured and implemented to encode domain knowledge. 
Krister Ahlander : I agree in this interpretation. Perhaps an ad- 
ditional point of the talk is that already the choice of mathematical 
abstractions influence the structure and design of the software domain. 
Scott Kohn : At what point do concrete implementation details enter 
the abstraction? This is a very difficult question in scientific computing. 
I’m not sure that anyone has the answer. 

Krister Ahlander : At the application level, we avoid implementation 
details. Concrete implementation details are coded in lower level mod- 
ules, tensors, scalar fields, meshes. These are chosen at link time. As a 
remark, we think that compile time polymorphism is important in order 
to achieve high performance. 

Reagan Moore : How does one guarantee numerical accuracy when 
the manifold, coordinate system, and physical parameters required for 
solving a differential equation must be simultaneously discretized. 
Krister Ahlander : The purpose of Sophus is to separate different 
discretizations in different modules. We do not guarantee that a chosen 
module configuration is appropriate from a numerical point of view, but 
we provide the possibility to freely choose among different configurations. 
John R. Rice : A comment on previous question and answer: The 
difficulties here are illustrated by a simulation of the behavior of a small 
fluid droplet. The droplet surface is changing and its smoothness has a 
strong effect on behavior because of surface tension effects. To obtain 
physically correct results one must start with a mathematical representa- 
tion of smooth surfaces that change smoothly. People who have started 
with a discretization of space into a grid or mesh have failed because 
their constructions introduced small deformations in the surface which 
are quickly magnified by surface tension. The mathematical representa- 
tion of smooth manifolds which can vary easily is not a simple or well 
known technology, but it can be done. A multiple droplet environment 
where surfaces merge or divide is even more challenging both mathe- 
matically and in the physics. Nevertheless, it is essential to start with a 
high level abstraction of the drop surfaces. 
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Abstract This is a status report of a long-term research effort focusing on object- 
oriented modeling of parallel PDE solvers, based on finite difference 
methods on composite, structured grids. Two previous results of this 
effort are reviewed, the class libraries Cogito and Compose. Cogito is 
implemented in Fortran 90, with MPI for the message passing, and 
provides abstract data types for parallel composite-grid methods. Com- 
pose is in C++ and allows for fully object-oriented construction of PDE 
solvers by composition of objects. The object model behind Compose 
is described, and some research issues related to the refinement of the 
model are outlined. Finally, some recent results are presented, which 
are initial steps in addressing these issues. 
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1. INTRODUCTION 

The traditional programming style in scientific computing is procedu- 
ral, plain Fortran. This leads to very efficient programs, but the process 
of constructing the programs is time-consuming and error-prone. The 
latter drawback has become increasingly accentuated as both the com- 
puter architectures and the scientific applications have become more 
complex. In fact, the lack of adequate software tools has been regarded 
as a serious impediment to the realization of the strategic potential of 
high performance scientific computing [10]. Consequently, during the 
last decade there has been a large number of research projects focusing 
on software issues in this field. 
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Figure 1 Simulation of airflow through ein exp 2 uision pipe. 

Many of the most successful projects have used an object-oriented 
approach. By now, the advantages of object-oriented scientific program- 
ming are widely recognized. The feasability of the approach was demon- 
strated in early projects, which consisted in enriching the programming 
language with abstract data types (ADTs) suitable for scientific com- 
puting, see, e.g., [16], [26]. The overall programming style was still 
procedural. In a second phase came projects aiming for a fully object- 
oriented approach, where the main program essentially vanishes, see, 
e.g., [4, and the references therein], and [6]. The role of the main pro- 
gram reduces to creating objects and eventually activating one object, 
which subsequently activates other objects. The actual algorithm con- 
sists in interactions between objects. 

Our research group has been part of this development. We focus 
on the numerical solution of partial differential equations (PDEs). The 
major goals are: 

1 to construct complete PDE solvers via composition of objects; 

2 that this should not be restricted to predefined numerical opera- 
tors, i.e., that new numerical methods could be implemented via 
composition of objects; 

3 that this way of constructing PDE solvers should be applicable to 
scientific and industrial problems on parallel computer platforms. 

Primarily we consider finite difference methods. In PDE solvers based 
on finite difference approximations, complicated geometries are handled 
via the insertion of composite, structured grids, for example multiblock 
grids. Such a grid is a collection of structured grids, the union of which 
covers the geometry at hand. To represent a grid function on the entire 
composite grid, the grid functions on the different element grids are tied 
together through interpolation at the grid boundaries. In a parallel com- 
puting context, this interpolation can lead to communication between 
processors. 

Fig. 1 shows a simplified model problem to illustrate this. It is a 
simulation of airfiow through an expansion pipe, and the geometry re- 
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quires a five-block grid. The simulation is based on the compressible 
Navier-Stokes equations. In the following, this application will be used 
for illustration. 

The present paper gives a status report of our work. We begin by 
reviewing the results of our previous software projects Cogito [25] and 
Compose [2j. Both of them are class libraries, to be used for constructing 
PDE solvers. Cogito represents the ADT approach, whereas Compose is 
fully object-oriented (in the sense discussed above). Our current research 
aims at elaborating the object model of Compose. We discuss some 
research issues in that context, and present some recent results. 

2. REVIEW OF PREVIOUS RESULTS 

2.1. COGITO; ABSTRACT DATA TYPES 
FOR COMPOSITE GRIDS 

For the problems we are considering, with complicated data struc- 
tures, and the additional aspect of parallelization, the traditional way 
of constructing a PDE solver (from scratch, in Fortran) is inadequate. 
There is an apparent need for software tools that raise the level of ab- 
straction considerably. 

To this end, we developed Cogito [25], a Fortran 90 library supporting 
implementation of parallel PDE solvers. Cogito has an object-oriented 
design. The core classes are Grid, Composite Grid, and Grid Function. 
Each individual instance of these classes is automatically distributed over 
several processors. The parallelism is of SPMD type, and uses message 
passing via MPI. The message passing takes place within the object and 
is invisible to the user. 

We note in passing that although Fortran 90 is not an object-oriented 
language, it allows for an object-oriented style of implementation, via its 
mechanisms for modularization, data abstraction, and dynamic memory 
allocation. This has been noted by several authors, and [8] gives an 
exposition of “object-oriented” Fortran 90 programming. 

The data partitioning in Cogito is handled by a data partitioning mod- 
ule with an object-oriented design [22]. It is based on our own frame- 
work for partitioning of composite grids [23]. Within this framework, 
it is possible to implement a wide range of specific partitioning algo- 
rithms. Fig. 2 shows two examples of partitionings for the multiblock 
grid of the model problem discussed above. To the left is a straightfor- 
ward approach, where each block is partitioned into rectangles, one per 
processor. Thus, each processor will get one part of each block. The 
alternative shown to the right in Fig. 2 is the result of a more sophis- 
ticated approach, which combines structured and unstructured parti- 
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Figure 2 Example of two ways of partitioning and mapping the multiblock grid for 
the model problem in Fig. 1. Both of these alternatives, as well as many others, can 
be expressed in the partitioning framework contained in Cogito. 

! Initiation 

call Create.CGCcg, * pipe.dat* , * rectangle’) 
call Create.GF (u , * u ’ , 4 , eg) 

! Riinge-Kutta time marching 
do s s 1, nstages 
call spdiscCv, R) 
call Saxpy.GFCv, a(s), R, u) 
call Interpolate_GF(v, *pre’) 
call bound (v) 

call Interpolate_GF(v, ’post’) 
end do 



Figure 3 Code example with calls to Cogito. For explanations, see the text. The 
overall style of programming is procedural. The program is parallel. The user specifies 
what strategy to use for partitioning and distributing the grid. Apart from this, all 
the parallelism is hidden to the user. 



tioning techniques, giving one connected subdomain to each processor, 
thus reducing the communication. Both of these partitionings, as well 
as many others, can be expressed in the partitioning framework which 
is contained in Cogito. 

The present classes in Cogito are essentially data stores with asso- 
ciated operations. They do not initiate interaction with other objects. 
Thus, the overall style of programming is procedural. In a PDE solver 
based on Cogito, the numerical method is expressed as operations on grid 
function objects. Fig. 3 shows, as an example, the initiation of some ob- 
jects, and a subsequent code sequence expressing the Runge-Kutta time 
marching scheme used in our 2D compressible Navier-Stokes solver. We 
use a naming convention, such that the Fortran routine implementing the 
operation B on class A gets the name B^A, (In order to avoid extremely 
long names, we use standardized abbreviations of the class names.) 
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It is assumed that the grid has been generated by a separate grid 
generator, and is stored in a file. In Fig. 3, Create.CG initiates a com- 
posite grid object, reading the grid data from the file pipe.dat. The 
grid is to be partitioned into rectangles (see Fig. 2, left-hand side) and 
these are distributed over the processors. Next, Create.GF creates a 
grid function u on the composite grid. This grid function will represent 
the numerical solution of the 2D compressible Navier-Stokes equations, 
and consequently it has four components per grid point. The grid func- 
tion is automatically distributed in the same way as the corresponding 
grid. Subsequent operations on the grid function are carried out on the 
entire grid, and the message passing is hidden from the user. 

The second code sequence in Fig. 3 implements a Runge-Kutta time 
marching scheme, which computes a number of Runge-Kutta stages 
y(s) — yTi a{s)R{v^^~^^). Here, R is the discretized right-hand side of 
the compressible Navier-Stokes equations. The discretized spaee deriva- 
tives are computed inside the subroutine spdisc, which is supplied by 
the user. This subroutine, as well, is implemented via calls to oper- 
ations on Grid Function objects. The Saxpy.GF operation computes 
u + a(s)*R, where u and R are grid functions, and the result is stored 
in the grid function v. 

2.2. COMPOSE: OBJECT-ORIENTED 
COMPOSITION OF PDE SOLVERS 

Writing a PDE solver based on Cogito relieves the programmer of a 
considerable amount of low level details, concerning the data structures 
and the parallelization. However, the numerical method still has to 
be hand coded, as a sequence of operations on grid function objects. 
Moreover, the coupling between the numerical method and the PDE 
problem remains, which leads to limited reusability of the code. In the 
code example above (Fig. 3), the Runge-Kutta code segment is reusable, 
but the user-supplied subroutine spdisc is specific for a certain discrete 
approximation of a certain PDE problem. Thus, if we wish to address 
the same equations with a different approximation, or apply the same 
approximation to another set of equations, the subroutine spdisc has 
to be rewritten from scratch. 

In order to increase the reusability of the code, we developed Compose 
[2]. The goal was to allow for “component-based” construction of PDE 
solvers. The approach remained object-oriented, so the “components” 
were to be objects. 

The object-oriented framework Compose is implemented in C-l— I- and 
contains classes representing the mathematical equations, boundary con- 
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Figure 4 The key classes of Compose. 



ditions, etc., as well as various aspects of the numerical methods. Such 
objects can be composed into a complete PDE solver, which thus be- 
comes fully object-oriented in the sense discussed above. 

The Compose project emphasized the object-oriented analysis, and 
the resulting object model was a main result of the project. This model 
can be regarded as a general framework for the construction of PDE 
solvers. The key classes on the uppermost level of abstraction are shown 
in Fig. 4. On this level of abstraction, the model applies to a variety of 
PDE solvers, with different underlying numerical approaches. Our im- 
plementation of the model, however, focusses on finite difference methods 
on composite, structured grids. 

The Compose object model distinguishes between the PDE Problem 
and the PDE Solver. There is an association between the two, reflect- 
ing that a PDE solver is associated to the problem it is to solve. The 
PDE Problem is an aggregate of Equations, and the PDE Solver is an 
aggregate of System Solvers, each of which is an aggregate of Equation 
Discretizers. The Equation Discretizer is associated to an Equation. 

Fig. 4 does not show the lower level classes Grid, Grid Function, etc. 
However, such classes are present in Compose as well, and provides a 
supporting lower layer. In our implementation of the Compose object 
model. Overture [5] serves this purpose. Overture is a C-f- 1- library 
similar to Cogito. In Overture, the discrete grid function has differen- 
tial operators. For example, u.x() represents the differentiation of u 
with respect to x. This notation is used to express the PDE problem. 
However, the actual computation of the differential operators is done via 
discrete approximations. Overture has a class that represents a pack- 
age of discrete space operators. When the grid function is initiated, it 
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gets associated to such a package, which then specifies which discrete 
approximations to use for the various derivatives. 

We now explain the dynamic behavior of the Compose object model 
for the case of an explicit time marching method, as in the example of 
our 2D compressible Navier-Stokes solver above. In Compose, the com- 
pressible Navier-Stokes equations would be an inheritor of the Equation 
class. The PDE Solver would have an Explicit System Solver (an inheri- 
tor of System Solver). The space operator package object is an argument 
to the System Solver constructor. Thus, the Explicit System Solver will 
be able to establish the connection between the solution (grid function) 
and the discrete space operators. When the Explicit System Solver is 
to advance the solution to the next time level, it tells the Runge-Kutta 
Discretizer to update the solution. This particular equation discretizer, 
an inheritor of the base class Equation Discretizer, knows what time 
marching algorithm to use, but it tells the equation object to compute 
the right-hand side (cf. R in Fig. 3). The equation object then applies 
the various differential operators (and other operators) occurring in its 
right-hand side. The solution (an instance of Grid Function), via its 
associated space operator package, knows what discrete approximations 
to use for these derivatives. 

In the case of an implicit time marching method, the dynamic be- 
havior is similar, but more complicated. Then, the Equation Discretizer 
contributes to the construction of an algebraic system. The various 
contributions from different equation discretizers are assembled by the 
Implicit System Solver, which subsequently solves the system. 

It should be noted that each of the classes Equation, System Solver, 
and Equation Discretizer is the abstract base class of an inheritance 
hierarchy. As an example, the Equation hierarchy has a first level of 
inheritors representing various kinds of equations. The actual equations 
are on the second level of the inheritance tree. Not only the partial 
differential equations are represented as equation objects, but also the 
boundary and initial conditions. 

The Compose model has been demonstrated to work in practice, for 
example in the case of the 2D incompressible Navier-Stokes equations [1]. 
In the Compose-based implementation of a solver for these equations, the 
PDEs (a system of two convection-diffusion equations for the velocities, 
and an elliptic equation for the pressure) are represented as independent 
objects, as are the initial conditions and various boundary conditions 
needed. Moreover, the elliptic equation for the pressure could reuse an 
existing class for the Poisson equation, and the solver for that equation. 
However, since the solver objects are separate from the equation objects, 
we could as well reuse the equation only, and connect it to a new solver. 
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This distinguishes Compose from Diffpack [4, Chapter 11], which also 
has a fully object-oriented structure, but does not handle the equations 
as independent objects. Another software library with similar scope, 
ELEMD [4, Chapter 4], has separated the equations, and also includes a 
class Equation Discretizer. There are differences in details, mainly due 
to the fact that ELEMD emphasizes finite element methods, whereas 
Compose focusses on finite difference methods. 

Another related effort is POOMA [6]. There is no apparent coun- 
terpart to Compose in the five-layer model of POOMA. The top layer, 
the application layer, differs between applications, and there is no gen- 
eral model for composing applications, whereas this is precisely where 
Compose has its focus. 

Finally, Compose has a built-in support for code validation and mon- 
itoring. The equation classes include operations that allow for testing 
the equations with known solutions (via the technique of forcing) [2]. 
This, in turn, is based on support in Overture for this kind of testing. 
Moreover, Compose includes the concept of Monitor classes, which can 
be used for computational steering [2]. None of these features seem to 
be available in the related work discussed above. 

3. FINE-GRAINED MODELING OF 
NUMERICAL OPERATORS 

3.1. OVERVIEW 

With respect to the goals stated in § 1, the prototype implementations 
of Compose and Cogito address the first and third goals. In our ongoing 
work we are exploring different directions for extending and improving 
the Compose model in view of the second goal. That is, the aim is to 
make it possible to construct numerical operators incrementally, starting 
out with a set of basic objects. In this way, the user will be able to design 
new numerical algorithms within the object-oriented framwork, without 
having to fall back on “low level” programming in C-1-+ or Fortran 90. 

Primarily, we are focusing on the following types of operators. 

■ Stencils. We have recently equipped Cogito with a Stencil class, 
which allows for incremental construction of stencil operators. 

■ Coordinate invariant operators. In the same spirit as a group at 
Bergen University [3], with which we are interacting, we aim at 
distinguishing between the mathematical formulation of the differ- 
ential equations and the actual evaluation in a specific coordinate 
system. Since the equations occurring in applications can often 
be expressed in terms of coordinate invariant operators (gradient. 
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divergence, curl, etc.), software support for such operators would 
increase the reusability of the software. As an example, a prob- 
lem with a cylindrical geometry may then be simplified to a 2D 
problem without changing the equations. 

The support for coordinate invariant operators is also appealing 
in the context of curvilinear structured grids, where the actual 
computations take place on a rectangular grid. Then, there is by 
necessity a coordinate transformation involved, between the com- 
putational grid and the physical grid. This calls for a reformu- 
lation of the PDE, involving the metric tensor [7, p. 68] of the 
mapping. However, if the equations are expressed in terms of co- 
ordinate invariant operators, and if such operators are available in 
the object-oriented software library, then the reformulation of the 
equations can be avoided. 

The expression for a coordinate invariant operator in a specific 
coordinate system includes derivatives of various quantities. For 
a user who wishes to fine-tune the algorithm, it is desirable to 
be able to decide precisely how these derivatives are going to be 
approximated in the discretized operator. Consequently, we want 
to allow for composition of coordinate invariant operators, using 
basic building blocks such as difference stencils. 

It can be noted that Overture [5], as well, addresses the issue of 
automatizing the mapping between the computational and phys- 
ical coordinate systems in the case of curvilinear grids. However, 
they do not support coordinate invariant operators in the way dis- 
cussed above. Neither do they allow for “component-based” design 
of operators, as envisaged here. 

General difference operators. A natural extension of these ideas 
would be to allow for incremental construction of general differ- 
ence operators. For example, the complete right-hand side of a 
difference method for a nonlinear PDE could be expressed as a 
single difference operator. 

Preconditioners. The idea of fine-grained modeling also carries 
over to other types of operators. Preconditioners is an example. 
So far preconditioners have been regarded as atomic units in ob- 
ject modeling. In a pilot study [18], we went beyond that limit, in 
that we presented a way of constructing a certain family of pre- 
conditioners from “smaller” objects. We are currently generalizing 
these ideas. 
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Remark: The fine-grained decoupling of operators that we are aiming 
for is motivated by the needs of algorithm developers. Many users do not 
require this flexibility, but are satisfied with using a standard variant of 
an operator. For them, we envisage that a library of predefined operator 
objects will be available. 

In the following, we discuss in more detail two cases of fine-grained 
modeling of numerical operators: stencils and preconditioners. 

3.2. GENERALIZED STENCIL OPERATORS 

We have recently extended Cogito with stencil operators [19]. Objects 
of the class Stencil can be created as combinations of “smaller” objects 
of the same class. There is a library of basic objects, representing the 
identity operator, the shift operators in different space directions, and 
the standard difference approximations of the first derivative. By op- 
erations such as composition of stencils, multiplication of stencils with 
coefficients (scalar or matrix), etc., complex stencils can be built. This 
makes it possible for the user to design new stencil operations within the 
framework, and to store them in the library for subsequent reuse. 

As an additional benefit, the ability to collect a large number of arith- 
metic operations into a single stencil operation leads to improved caehe 
utilization. In our experiments, on a Sun Wildfire parallel platform, we 
observed reductions of 50% in execution time when the new stencil class 
was introduced [19]. This is explained partly by the cache eflfects, partly 
by additional code restructuring that helped the Fortran 90 compiler do 
a better job. 

The stencils are “generalized” in the sense that a single stencil object 
can act on several components of a grid function, and can have different 
actions on difierent components. Moreover, the number of components 
in the result can be different from the number of components of the 
operand. Thus, in general, we allow for stencils where the coefficients 
are rectangular matrices. 

When a stencil is going to be used, it is connected to a grid function. 
Internally, in the grid function object, the stencil is then stored in a 
table, and persistent MPI objects for the communication are set up. 
The actual application of the stencil takes place via subsequent calls 
to Apply Stencil, which is an operation of the class Grid Function. By 
locating the application inside the grid function object, the internal data 
structures of that object can be accessed directly, which is important for 
efficient execution of the stencil operation. 

The idea of a stencil class is not new. The point of our particular 
design is the support for incremental construction of stencils. POOMA 
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[13] has a stencil class that allows for efficient application of stencils, but 
the design of new stencils requires C++ coding. Karpovich et al. [14] 
have no separate stencil representation, but provide classes for applying 
a sequence of stencils to a matrix. 

3.3. PRECONDITIONERS BASED ON FAST 
TRANSFORMS 

As a second example of fine-grained modeling of numerical operators, 
we discuss preconditioners based on orthogonal transforms, also known 
as normal block preconditioners [11], [20]. This is linked to other research 
activities at our department, where the aim is to design new normal block 
preconditioners for discretizations of systems of PDEs. By modeling 
the preconditioners as described in the following, we will provide these 
colleagues with a laboratory in which they can conveniently compose 
new preconditioners and experiment with them. The savings, in terms 
of human efforts, will be huge, and with careful implementation the 
additional overhead in execution time will be negligible. 

The construction of a normal block preconditioner for the discrete 
system Bu = g goes as follows: 

■ Select a set of discrete transforms, and decide which transform to 
apply in what space direction. 

■ Form a normal block operator M with the same block structure 
as B, but where the blocks at one or more levels of M are di- 
agonalizable by the transform matrices, and where M is the best 
approximation to B measured in the Probenius norm [20]. 

The subsequent application of a normal block preconditioner can be effi- 
ciently implemented in parallel, in terms of fast transforms and solution 
of narrow-banded systems [15]. 

Traditionally, the system Bu = g is interpreted as a linear algebraic 
system with coefficient matrix B, and with u and g as vectors. However, 
since u and g are actually grid functions with four indices each — one 
for the components and one for each space direction — B is a tensor [7] 
with four upper indices (“row” indices) and four lower indices (“col- 
umn” indices). It is largely the underlying grid that determines the 
block structure of the tensor B. It would be obscured by a conversion 
to matrix form. Consequently, the construction and application of the 
normal block preconditioner is much more conveniently expressed if B 
is maintained in its original tensor form [20]. This is a motivation for 
introducing the tensor as a new basic data type in Cogito. 
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In our tentative object model for the construction of normal block 
preconditioners, the Normal Block Solver constructs the preconditioner, 
and knows how to carry out subsequent preconditioner solve operations. 
The construction of the normal block operator M is based on the Band 
Tensor B, and on a Poly transform, which is an ordered set of Trans- 
form objects. The resulting preconditioner is also a band tensor, and 
is tightly connected to the normal block solver. Each transform is a 
discrete trigonometric transform, which is associated with the discrete 
Fourier transform, which is further associated to the radix-2 fast Fourier 
transform. 

We have recently made a serial pilot implementation of this model, 
in Fortran 90. It contains seven orthonormal and three nonorthonormal 
transforms. Moreover, it has operations such as applying a polytrans- 
form to a grid function, computing the inner product between a band 
tensor and a grid function, and applying the normal block solver. The 
serial code will be used for validation (and possibly modification) of 
the model, before we go on to a parallel implementation. Note, that 
these implementations do not begin from scratch, but can build on our 
previous experiences [12, 21, 15]. 

4. MIXED C++/F90 IMPLEMENTATION OF 
FLOW SOLVERS 

The revised object model we aim for is to be implemented on the basis 
of our previous software Cogito (Fortran 90) and Compose (C-f 4-). The 
intention is to develop a fiexible framework for construction of parallel 
PDE solvers using the object-oriented capabilities of C+-I-, which will 
execute with high parallel efiiciency via the Fortran 90 components of 
Cogito. 

As a first step in this direction, we have made a C-f + embedding of the 
"object-based” Fortran 90/MPI version of Cogito [17]. We have demon- 
strated for a scalar advection problem in 2D that the C-l-4-/Fortran 90 
version gives almost exactly the same execution time as the pure For- 
tran 90 code. 

For further validation of the mixed-language approach, we reimple- 
mented the 2D compressible Navier-Stokes code by calling the new, 
mixed language version of Cogito. In this case as well, there was a neg- 
ligible difference in execution time between the pure Fortran 90 and the 
C-l-4-/Fortran 90 version of the Navier-Stokes solver. In addition, we 
measured scalability, in terms of sizeup [24] for the two versions of the 
Navier-Stokes code. The two codes show (almost) identical behavior. 
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The maximum discrepancy is 4.5% and there is no trend of an increas- 
ing discrepancy as the number of processors goes up [17]. 

Using calls to Fortran from within C-l— I- classes is relatively straight- 
forward. Our situation is more complicated, since we wrap C-l— I- around 
an object-based Fortran 90 code. This problem was discussed in [9]. Our 
approach [17] is similar, but avoids one of the steps of wrapping. More- 
over, we have demonstrated that our approach works in practice for a 
relatively large object-based Fortran 90 library, i.e., Cogito. 

5. CONCLUSIONS 

In previous work, we have explored the potential of object-oriented 
program min g in the context of numerical solvers for partial differen- 
tial equations. First, we demonstrated that an object-oriented style of 
implementation of parallel PDE solvers in Fortran 90 is feasible, and 
relieves the programmer of many low-level details. Next, we proposed 
the object-oriented framework Compose, which allows for fully object- 
oriented construction of PDE solvers. In Compose, the PDEs are rep- 
resented as separate objects, independent of the numerical approach to 
be used. A pilot implementation of Compose, in C-|— 1-, shows that the 
model is applicable to realistic application problems, such as the incom- 
pressible Navier-Stokes equations on composite, structmed grids. 

We conclude that it is relevant to continue elaborating the Compose 
model. In particular, we have discussed the introduction of a more fine- 
grained modeling of numerical operators, so that complicated operators 
can be constructed via composition of simpler ones. As a preliminary 
result in this direction, we described a generalized stencil class that has 
recently been implemented. Moreover, we discussed the object-oriented 
modeling of normal block preconditioners. Finally, we presented the 
ambition to base the next implementation of the C-I-+ library Compose 
on our “pseudo object-oriented” Fortran 90 library Cogito, which exe- 
cutes efficiently on parallel platforms. The new mixed C-l--l-/Fortran 90 
version of Cogito promises to be a suitable basis for this development. 
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DISCUSSION 

Speaker: M. Thune 

Fred Gustavson ; Did you use parallel narrow band solvers in your 
implementation? 

M. Thune : The pilot implementation of the Normal Block Solver and 
related classes is serial, but of course we intend to carry on this work with 
a parallel implementation. As I mentioned, Cogito is parallel. Moreover 
my co-author Kurt Otto and his colleagues have had for many years 
a non-object-oriented implementation of parallel preconditioners of this 
type. 
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Abstract Scientific programs rely heavily on software libraries. This paper de- 
scribes the limitations of this reliance and shows how it degrades soft- 
ware quality. We offer a solution that uses a compiler to automatically 
optimize library implementations and the application programs that 
use them. Using examples from the PLAPACK parallel linear algebra 
library, we present our solution, which includes a simple declarative 
annotation language that describes certain aspects of a library’s imple- 
mentation. We also show how our approach can yield simpler scientific 
programs that are easier to understand, modify and maintain. 

Keywords: software libraries, optimization, meta-interfaces 

1. INTRODUCTION 

The goal of a software architecture is to promote code reuse and to 
allow programs to be easily maintained and modified. These goals are 
particularly diflficult to achieve in the context of scientific computing, 
which can be characterized by three properties: (1) efficient runtime per- 
formance and efficient memory usage are critical, (2) the practitioners of 
scientific computing are typically not schooled in software engineering, 
and (3) deep knowledge of the scientific domain is required. The first 
property tempts programmers to emphasize performance over clarity, 
which often complicates the long term maintenance and portability of 
scientific codes. The second property explains why scientific program- 
mers are typically unwilling to try novel languages or to use sophisticated 
design methodologies. In particular, it explains why scientific computing 
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Figure 1 Architecture of the Broadway Compiler system 



relies so heavily on software libraries. The third property, the require- 
ment of deep domain knowledge, represents an underutilized opportunity 
that we will attempt to exploit. 

Software libraries offer several strengths. They do not require the user 
to learn new language syntax, they can raise the level of abstraction 
to support common operations, and they provide a simple means of 
reusing code. Thus, software libraries have become a de facto software 
architecture for scientific programming. Unfortunately, libraries place 
the burden of optimization on the library user and force optimizations 
to be implemented directly in the application’s source code. As this 
paper will illustrate, these manual optimizations adversely affect the 
application program by decreasing clarity, reusability, and portability, 
while increasing program complexity. 

This paper describes a method of automating the optimization of 
library implementations and the application programs that use them. 
This new approach allows applications to use simpler interfaces to exist- 
ing libraries, and it yields cleaner application programs that are easier to 
understand and maintain. Furthermore, our approach allows scientific 
programmers to continue using libraries in the same manner with which 
they have become accustomed. In essence, we are proposing a method 
of transforming software libraries into a viable and effective software 
architecture. 

Figure 1 shows the overall architecture of our system. At the core 
is the Broadway compiler, which takes as input the application source 
code, the library source code, and a set of annotations that describe 
the library. The compiler produces as output an integrated, optimized 
library and application program.^ The annotation language is critical 
because it conveys to the compiler domain-specific information that can 
be used in the optimization process. These annotations allow the Broad- 
way compiler to analyze and manipulate library operations in the same 
way that ordinary C compilers analyze and manipulate the primitives of 
the C language. 

This paper makes the following contributions. 
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■ We illustrate the long term mainten^lnce and portability problems 
caused by the use of libraries in high performance programs. 

■ We describe the Broadway annotation language as a meta interface 
and explain how it improves the maintenance and portability of 
applications that use libraries. 

The remainder of this paper is organized as follows. Section 2 ex- 
plains the weaknesses of using software libraries as an architecture for 
creating performance-critical applications. Section 3 then explains how 
performance optimizations are typically applied to traditional libraries, 
and Section 4 explains how our solution uses a meta interface to address 
the weaknesses of existing software libraries. Section 5 discusses the long 
term benefits of our solution and its meta interface. We distinguish our 
work from related work in Section 6 and conclude in Section 7. 

2. WEAKNESSES OF SOFTWARE 
LIBRARIES 

Software libraries lead to a number of closely related performance 
problems: 

1 Different clients have different needs. An implementation 
that is appropriate for one client can be inappropriate for another. 
Here we use the term “client” to refer to an application program 
that invokes library routines. 

2 “Separation of concerns” inhibits information flow across 
interfaces. The performance of a library can typically be im- 
proved if the implementor is made aware of the client’s needs. 

3 Worst case assumptions provide generality at the expense 
of performance. To provide correct behavior in all situations, li- 
braries make worst case assumptions, which can lead to excessive 
copying of data, excessive synchronization, and unnecessary ini- 
tialization of data. 

4 Modular structure leads to poor resource management. To 

provide encapsulation and safety, memory management is typically 
performed by library routines. However, resomce management can 
often be improved by giving the application program control so 
that resources can be managed globally. 

These performance problems are significant because they lead to a phe- 
nomenon that we call Interface Bloat. The only way that libraries can 
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support a diverse set of clients is to provide a wide interface that in- 
cludes a large number of specialized routines. Such interfaces can often 
be separated into two groups, a Core interface that provides all of the ba- 
sic functionality of the libraxy, and an Advanced interface that provides 
specialized routines that are applicable only in specific situations. 

Interface Bloat leads to both short term and long term problems. The 
first short term problem is that large, complex interfaces are difficult to 
use. For example, MPI provides 12 ways to perform point-to-point com- 
munication [18]. These routines don’t differ in their functionality, but 
differ in their buffering of data, their completion semantics, etc. The 
second short term problem is that the routines in the Advanced inter- 
face axe typically more difficult to use, which increases the complexity 
of application programs. For example, MPI’s Ready-Send assumes that 
the sending and receiving processes axe already synchronized and that 
the receiver has prepared a sufficient buffer for the receipt of the mes- 
sage. Thus, Ready-Send requires the careful orchestration of the sending 
and receiving processes. Another example comes from the GNU Multi- 
Precision Library [11]: 

The mpn functions [the Advanced interface] are designed to be as fast 
as possible, not to provide a coherent calling interface. The different 
functions have somewhat similar interfaces, but there are vsiriations that 
mahe them hard to use. These functions do as little as possible apart 
from the real multiple precision computation, so that no time is spent 
on things that not all callers need. 

More seriously. Interface Bloat leads to long term software engineering 
problems with respect to both portability and maintenance: 

No performance portability. Ready-Send is typically the most 
efficient form of point-to-point communication on distributed memory 
machines, but on machines with hardware support for shared memory, 
MPI -Get 0 and MPI .Put () axe faster. Thus, programmers must recode 
their application to optimize the communication for different machines. 
This means, for example, that the invasive changes required to use 
Ready-Send can be counter-productive, as they complicate any subse- 
quent porting and tuning efforts. 

Premature Optimization Complicates Maintenance. The 
use of specialized routines represents a form of premature optimization, 
which is a common source of problems [16]. Because the optimizations 
axe embedded in the source code, the program’s overall logic can be 
obscured, making programs more difficult to read and maintain. For 
example, to be profitable, an asynchronous receive requires that some 
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computation be moved above the waitO to hide the latency of the 
message: 

sendO sendO 

recvO => irecvO 

compute ( ) ; compute 1 ( ) ; 

waitO 
compute2() ; 

This restructuring of the computation can make the program more dif- 
ficult to understand since it breaks a single logical unit of computation 
into two pieces. It also implicitly introduces new dependence relations 
among the different pieces of code that must now be maintained. In 
the above example, the code in computelO cannot be dependent on the 
data that is being sent. 

Interface Bloat Defeats Modularity. Bloated interfaces often 
expose implementation details to the client. This violation of Parnas’ 
modularity principle [19] leads to an overly strong coupling between 
modules. Whereas a buffered Send routine encapsulates all synchroniza- 
tion, Ready-Send scatters it throughout the program. Strong coupling 
defeats portability, as different hardware environments can prefer differ- 
ent versions of the point-to-point communication routines [6]. 

3. LIBRARY-LEVEL OPTIMIZATION 

This section explains how the use of libraries can be optimized without 
incurring the penalties described in the previous section. We present a 
detailed example using a parallel linear algebra libraxy, and we use this 
example to draw conclusions about library-level optimization and to 
characterize our compiler-based solution. 

3.1. PLAPACK EXAMPLE 

The PLAPACK library is a set of routines for coding parallel linear 
algebra algorithms in C or Fortran [21]. PLAPACK aims to provide high 
performance, and the library has been carefully designed by experts 
in the area of parallel linear algebra. PLAPACK consists of parallel 
versions of the same routines found in BLAS [8] and LAPACK [1]. At the 
highest level, it provides an interface that hides much of the parallelism 
from the programmer. 

PLAPACK provides abstractions that can be useful for performing 
optimizations. For example, PLAPACK programs manipulate linear al- 
gebra objects indirectly though handles called views. A view consists of 
data, possibly distributed across processors, and an index range that se- 
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Figure 2 Cholesky factorization using PLAPACK. 



lects some or all of the data. A typical algorithm operates by partitioning 
the views and working on one piece at a time. While most PLAPACK 
procedures are designed to accept any type of view, the actual param- 
eters often have special distributions. Recognizing and exploiting these 
special distributions can yield significant performance gains [2]. 

Figure 2 shows a Cholesky factorization program written with PLA- 
PACK, along with graphical depictions of the matrix at each step. The 
PLA_Obj_split_size routines ensure that the split occurs on a proces- 
sor boundary. Thus, the smallest piece. All (the black view in step 3), 
resides entirely on a single processor, and A21 (the black view in step 
4) resides on a column of processors. We can exploit these two facts 
by replacing the general-purpose PLA_Trsm and PLA_Syrk routines with 
customized routines that run as much as three times faster [12]. 

3.2. LESSONS FROM OUR EXAMPLE 

A key concept in the above optimization is the replacement of general 
routines with specialized routines that can make stronger assumptions 
about their calling context, and thus can execute more efficiently. Such 
optimizations are possible because most bloated library interfaces pro- 
vide many specialized routines in their Advanced interface. In the case of 
PLAPACK, the interface is technically an “open infrastructure,” which 
allows library users to see the lower levels of the library. 
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Another key to this optimization lies in analyzing the program to dis- 
cover the special case matrix distributions. Human programmers who 
are facile with PLAPACK can perform such analysis manually. Conven- 
tional compilers, however, cannot perform such analysis because most 
programming languages have no notion of a matrix, let alone matrix 
distributions. Thus, to perform the types of optimizations described 
above, the compiler must be informed of the relevant domain-specific 
abstractions so that program analysis can be phrased in these terms. 

Our compiler-based solution thus uses an annotation language to de- 
scribe domain-specific information. The language provides a mechanism 
for identifying important library-specific concepts, such as the notion 
of a view in PLAPACK, and for enumerating important properties of 
those concepts, such as the fact that a view can reside on a single pro- 
cessor. For example, the following annotation identifies four important 
properties of views: 

property Distribution = {Local, Empty, Matrix, ColP 2 Uiel, RoHP^ulel }; 

The annotations can also describe how the various library routines ma- 
nipulate these properties and how such properties can be used to replace 
a general routine with a more specific and efficient one. For example, the 
PLA_0bj_vert_split_2() routine might have the following annotation: 

int PLA_0bj_vert_split_2(obj , length, left, right) 

... // other einnotations omitted 

property Distribution { 

(Viewl .Distribution == Matrix) => left = Local, right = Matrix; 

} 

specializations { 

(Viewl. Distribution == Empty) => NOOP; 

} 

> 

The property construct indicates that this routine creates two views, 
left and right, with the specified properties; the specializations 
construct indicates that if Viewl (which is associated with obj through 
an annotation that is elided from this figure) is Empty, then an invocation 
of PLA_0bj_vert_split_2() can be removed since it is a no-op. Our 
annotation language also provides other features that facilitate program 
analysis. Details of our language can be found elsewhere [12, 13]. 

While the optimizations described in this section can be performed 
manually, two points are significant. First, such optimizations are te- 
dious and require intimate knowledge of the PLAPACK library. Second, 
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Figure 3 Blctck Boxes (left) 2 ind Black Boxes with Meta Interfaces (right). 



manual optimization is limited by the library’s interface, but compiler- 
based optimization is not. In particular, the Broadway compiler can 
specialize library routines in ways that the library designer did not fore- 
see, producing inlined or cloned versions that are optimized for their 
specific calling context. 

4. BROADWAY AS A META INTERFACE 

Section 2 enumerated four weaknesses of software libraries. The 
first of these has previously been identified as a limitation of black 
boxes [14, 15, 17]. In particular, the use of black boxes leads to per- 
formance problems because the implementation and interface that black 
boxes provide will inevitably be inappropriate for some client. One solu- 
tion to this problem is to provide two interfaces, a base interface, which 
most clients use, and a separate meta interface, which allows the black 
box to adapt to the needs of diflFerent clients [14]. Figure 3 shows a Black 
Box and a Black Box that has been augmented with a meta interface. 

The meta interface provides a controlled method of exposing the in- 
nards of a black box. The separation of the two interfaces is significant 
because each has different goals and each is aimed at a different user. 
The meta interface is aimed at sophisticated users and is typically ac- 
cessed much less frequently than the base interface. Meanwhile, the base 
interface is aimed at the typical user who does not want to modify the 
black box. The separation of the two interfaces allows the base interface 
to retain the simplicity of an idealized black box interface. 

The remainder of this section evaluates libraries and the Broadway 
compiler with respect to meta interfaces. We identify the different types 
of users in each system, the interfaces that are presented to these users, 
and the type of expertise that is expected of these users. 

Traditional Libraries. Traditional libraries (Figme 4) have no 
meta interface. In such systems, there are only two users: the appli- 
cations programmer who uses the library, and the library creator. The 
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only way to provide customized implementations is for the library cre- 
ator to expand the base interface, which forces the library user to deal 
with all of the problems of interface bloat. Bloated interfaces are poor 
substitutes for meta interfaces because they do not provide any mecha- 
nism for changing the implementation. This means that all specialized 
routines must be anticipated in advance by the library creator, rather 
than created in response to specific client needs. 

The shaded boxes in Figure 4 represent the amount of expertise that 
is required to implement the various components. For example, with tra- 
ditional libraries we see that the library writer must have considerable 
expertise in the library domain and must have some understanding of 
performance and application needs to implement algorithms efficiently. 
Significantly, we see that the C/Fortran compiler is given no knowl- 
edge of the library domain, so any library-level optimizations must be 
performed by the applications programmer. Thus, considerable burden 
is placed upon the applications programmer, who must not only under- 
stand the application domain, but must also possess considerable library, 
performance, and compiler expertise to achieve good performance. 

Broadway. The Broadway Architecture provides a meta interface 
to software libraries: The annotation language provides a way to change 
the library’s implementation so that it is more suitable for a specific 
client. In this approach, there is, in addition to the library writer and 
user, a library expert who creates the annotations. This person may 
or may not be the same as the library creator. While the Broadway 
architecture shown in Figure 4 is more complex than the traditional 
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library architecture, the added complexity is completely hidden from 
the applications programmer and the library writer. For example, the 
figure shows how the Advanced interface can be considered a part of the 
meta interface, rather them exposed to the applications programmer. 

The Broadway meta interface is a language for describing domain- 
specific analysis and domain-specific transformations. For example, the 
language can easily configure an analysis that determines the data distri- 
bution of matrices in a PLAPACK program, as described in Section 3.1. 
The annotations can also concisely specify code transformations that are 
triggered by the results of this analysis [12, 13]. 

5. RESULTS AND DISCUSSION 

This section evaluates our solution. We provide experimental evidence 
that our solution is effective, and we explain how our system’s meta 
interface provides many benefits over traditional libraries. 

Figure 5 [12] shows the result of applying our techniques to the 
PLA.Trsm routine of the Cholesky factorization program described in 
Section 3. The baseline measures the performance of the high qual- 
ity but general purpose PLA_Trsm routine. The hand-optimized routine 
was optimized by members of the PLAPACK development team to ex- 
ploit the specific distribution of matrices found in the Cholesky factor- 
ization program. Finally, the Broadway-optimized version represents a 
compiler-based approach that uses the same principles. The gap be- 
tween the hand-optimized and Broadway-optimized approaches shows 
an important benefit of automated approaches — they can apply tedious 
transformations uniformly and completely. 

5.1. BENEFITS OF THE BROADWAY 
ARCHITECTURE 

Provides a mechanism for improving performance. The 
Broadway meta interface improves performance by addressing all four 
weaknesses of traditional software libraries (Section 2). First, our solu- 
tion can create different library implementations and interfaces for differ- 
ent clients. Second, our solution conveys library-specific information to 
the compiler and uses this information to customize the library for differ- 
ent users. Thus, information flows across the meta interface through the 
Broadway compiler. Third, our solution replaces invocations of general 
routines with invocations to specialized routines, thereby relaxing worst 
case assumptions. These specialized routines might already exist in the 
library’s Advanced interface, or these specialized routines might be cre- 
ated by the Broadway compiler. Finally, by integrating library and client 
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PLA_Trsm kernel, Cray T3E 36 processors 




Figure 5 Performance comparison of baseline, hand-customized and Broadway- 
customized PLA_Trsm() function for the Cholesky program. 



code, our compiler can schedule operations globally, removing redundant 
operations across procedure call boundaries. While conventional compil- 
ers can perform interprocedural analysis to remove redundant primitive 
operations, our compiler can remove redundant domain-specific opera- 
tions, which typically leads to much greater runtime savings. 

Improves the maintenance and portability of applications. 

The Broadway architecture provides long term benefits in terms of main- 
tenance and portability. The existence of the meta interface allows the 
Broadway compiler to perform library level optimizations, reducing the 
application programmer’s temptation to perform premature optimiza- 
tions. By avoiding the Advanced interface, the programmer improves 
maintenance and portability. For existing libraries, our solution allows 
the Core and Advanced interfaces to be separated, with the Advanced 
interface being considered a part of the meta interface. This separation 
gives the programmer a simpler view of the library. For future libraries, 
our solution allows library designers to create simpler library interfaces. 
Thus, as shown in Figure 4, the applications programmer’s task is con- 
siderably reduced, so the predominant expertise required of the library 
user is application expertise. 
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Enhances the value of legacy codes. The annotations are stored 
separately from the application source code and are not visible to the 
applications programmer, so our solution applies to existing libraries and 
existing applications without modification to the vast base of existing 
source code. Thus, by separating the annotation language from the base 
interface, the Broadway architecture enhances the value of legacy codes. 

Amortizes costs. Prom the compiler writer’s point of view, the 
Broadway compiler is ideally written once, and this cost is amortized 
across many different libraries. From the library annotator’s point of 
view, the meta interface is ideally used once to create a set of anno- 
tations, and this cost is amortized over the lifetime of the library and 
across many applications. By contrast, the effort to perform manual 
library level optimization improves the performance of only a single ap- 
plication. 

Provides clean division of labor. Finally, our architecture sepa- 
rates the roles of the compiler writer, the library writer, and the appli- 
cation writer so that each task is simplified. All of the domain-specific 
expertise is localized in the annotations, which are supplied once by a 
library expert. The annotation language has been designed to mini- 
mize the amount of compiler expertise required to use it. Thus, all of 
the static analysis and optimization strategies are encapsulated in our 
Broadway compiler, as specific analyses and optimizations are implicitly 
configured by the information supplied by the annotations. Together, 
the annotation language and Broadway compiler free the application 
programmer to focus on designing clean applications and to resist the 
temptation to prematurely optimize their source code. 

6. RELATED WORK 

There has been considerable work in optimizing and customizing soft- 
ware libraries. The related work can be grouped into two categories. The 
first maintains the traditional library structure as shown in Figure 4, 
while the second uses a meta interface approach that is similar to ours. 
Among the meta interface systems, our approach has the advajitage of 
preserving the existing base interface exactly. 

Smart Libraries. A number of libraries have been built that 
attempt to select efficient implementations based on the specific values 
of input parameters [3, 5, 20]. These libraries provide a restricted degree 
of customization that is limited to a pre-defined set of implementations. 
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Automatically Generated Libraries. ATLAS [23], PHiPAC [4], 
and FFTW [10] have shown that efficient machine-specific libraries can 
be automatically generated. As with the “smart libraries,” these auto- 
matically generated libraries preserve the traditional library structure. 
These approaches address the issue of portability but do not provide a 
mechanism for customizing libraries for specific clients. 

Magik. Engler’s Magik system [9] has a structure that is very 
similar to ours (see Figure 6). Magik gives the programmer access to 
a C compiler’s internal representation and symbol table. Thus, Magik 
can be used to perform certain compiler transformations, as well as to 
extend the C language in limited ways. Magik differs significantly from 
Broadway in two ways. First, Magik theoretically provides more pow- 
erful transformational capabilities since it exposes all of the compiler’s 
internals to the meta programmer. However, this power comes at a 
cost: the meta programmer must possess both compiler expertise and 
library domain expertise. Second, Magik does not provide the ability 
to define new domain-specific analyses, which are central to library-level 
optimizations. 

Meta-Object Protocols. The notion of meta interfaces was pi- 
oneered in the domain of object oriented languages and Meta-Object 
Protocols (MOPs) [7]. Like Magik, these systems provide a mechanism 
to change the way a language is compiled, which provides both opti- 
mization and extension capabilities. In comparison to Broadway, MOPs 
provide more limited support for analysis and transformations. Most 
MOPs also provide ways to change the syntax of the base language. 
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Formal Semantics. Vandevoorde [22] defines a system whose 
structure is almost identical to Broadway’s, but whose approach is fun- 
damentally different. Vandevoorde optimizations are based on formal 
semantics and theorem proving, so the transformations require complete 
formal semantics of a procedure’s behavior, and they depend on theorem 
proving, which can only be partially automated. 

7. CONCLUSION 

In this paper we have explained how the lack of a meta interface en- 
courages library designers to produce bloated interfaces. These bloated 
interfaces in turn create long term portability and maintenance prob- 
lems. We have shown how the Broadway solution provides a meta inter- 
face that yields a desirable division of labor — among the library writer, 
the compiler writer, and the applications programmer — that is essen- 
tial in the domain of scientific computing in which high performance 
is critical and both libraries and applications require a large degree of 
domain expertise. Finally, we have argued that Broadway’s meta inter- 
face enhances the use of software libraries and improves the quality of 
application code. 
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Notes 

1. Many variations of this system are possible. For example, the library source might be 
encoded to prevent general access to the source, and the output code does not necessarily 
need to be produced as a single unified piece of code. 
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DISCUSSION 

Speaker: Calvin Lin 

Masaaka Shimasaki : Could you describe the data distribution 

method in more detail in relation to annotation? Is it possible to de- 
scribe how to distribute data to processors in manners such as block, 
cyclic, or block cyclic? Who should decide how to distribute data to the 
processors, the compiler or user? 

Calvin Lin ; Our annotations do not provide a way to specify data 
distributions. In our PLAPACK example, om: annotations enabled our 
compiler to reason about the way the data distribution of various ma- 
trices change during the execution of the application program, but the 
actual distributions were defined by the PLAPACK library and the ap- 
plications programmer. 

David W. Walker : In the example you gave comparing the perfor- 
mance of a PLAPACK routine, Broadway was at least a factor of two 
faster than the original code. What optimization did Broadway perform 
to achieve this impressive level of improvement? 

Calvin Lin : Conceptually, the performance benefit comes from spe- 
cializing a general purpose library routine for the specific context of the 
given application. This specialization can enable other optimizations, 
such as dead code elimination. In this case the domain-specific anal- 
ysis allowed a while loop to be converted to straight line code, which 
eliminated some expensive dead code. 

Richard Fateman : To what extent were you influenced by Common 
Lisp and its Meta Object Protocol? Some of the difficulties faced by 
Broadway could be alleviated by not being based on C! 

Calvin Lin : The key difference between a Metal Object Protocol 
(MOP) and our meta interface is the separation of concerns that our so- 
lution provides. Our solution separates the domain information from the 
compiler expertise. By contrast, most domain experts would find MOP’s 
difficult to use because MOP’s require the programmer to manipulate 
an internal representation of the program. 

Scott Kohn ; Is your annotation language expressive enough to address 
array transformations in C-l— H, such as Quinlan’s work on Rose, or the 
Standard Template Library (STL), such as done by Kai)? 

Calvin Lin ; This is an excellent question. In general, there’s the 
question of where we draw the line between what is hard coded in the 
compiler and what is configurable through the annotations. There are 
certainly some array optimizations that are difficult to express through 
annotations and others that are not. We’ve so far taken a very conserva- 
tive approach that attempts to keep the annotation language as simple 
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as possible, but I think that our current work in expressing more com- 
plex pattern matching will expand the type of array optimizations that 
are possible. 

John Rice : Can the Broadway annotation technique be extended to 
exploit run-time information? In the application of the previous speaker, 
a program runs for several days, but all the code is exercised in the first 
10 minutes. The cost of recompiling a 50 or 200 thousand line code is 
easily recovered if only a small improvement (say 5% or 10%) in efficiency 
is achieved. 

Calvin Lin : Yes, I believe that our approach is ideally suited to dy- 
namic compilation precisely because our annotations support high-level, 
domain-specific optimizations that can have large performance benefits. 
For example, I think that annotations could be used to guide the com- 
piler in inserting instrumentation code. 

Ronald Boisvert : PLAPACK may have been a particularly easy ex- 
ample for your system. It’s well structured and the domain provides 
msiny opportunities for such optimization. Not all code may be equally 
“annotatable” . Are there general principles that library designers can 
use to make their code more ameanable to your technology? 

Calvin Lin : Yes. Our approach works best with well-structured, mod- 
ular code. In addition, libraries are easier to annotate if they have 
well-defined (perhaps domain-specific) abstractions. 




FORMAL METHODS FOR 
HIGH-PERFORMANCE LINEAR 
ALGEBRA LIBRARIES* 



John A. Gunnels, Robert A. van de Geijn 
The University of Texas at Austin 
Austin, TX, USA 



Abstract A colleague of ours, Dr. Timothy Mattson of Intel, once made the fol- 
lowing observation: “Literature professors read literature. Computer 
Science professors should at least occasionally read code.” The point he 
was making was that in order to write superior prose one needs to read 
good (and bad) literature. Analogously, it is our thesis that exposure 
to elegant (and ugly) programs tends to yield the insights which are 
necessary if one wishes to produce consistently well-written code. 

Since the advent of high-performance distributed-memory parallel 
computing, the need for intelligible code has become ever greater. De- 
velopment and maintenance of libraries for these kinds of architectures 
is simply too complex to be amenable to conventional approaches to 
coding. Attempting to do so has led to the production of an abun- 
dance of inefficient, anfractuous code that is difficult to maintain and 
nigh-impossible to upgrade. 

Having struggled with these issues for more than a decade, we have 
arrived at a conclusion which is somewhat surprising to us: the answer 
is to apply formal methods from Computer Science to the development 
of high-performance linear algebra libraries. The resulting approach 
has consistently resulted in aesthetically-pleasing, coherent code that 
greatly facilitates performance analysis, intelligent modularity, and the 
enforcement of program correctness via assertions. Since the technique 
is completely language-independent, it lends itself equally well to a wide 
spectrum of programming languages (and paradigms) ranging from C 
and Fortran to C-h+ and Java. In this paper, we illustrate our obser- 
vations by looking at our Formal Linear Algebra Methods Environment 
(FLAME). 



*This work was partially supported by the Remote Exploration and Experimentation Project 
at Caltech’s Jet Propulsion Laboratory, which is part of NASA’s High Performance Comput- 
ing and Communications Program, and is funded by NASA’s Office of Space Science. 
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1. INTRODUCTION 

The core curriculum of any first-rate undergraduate Computer Sci- 
ence department includes at least one course that focuses on the formal 
derivation and verification of algorithms [8]. Many of us in scientific 
computing may have, at some point in time, hastily dismissed this ap>- 
proach, arguing that this is all very nice for small, simple algorithms, but 
an academic exercise hardly applicable in “our world.” Since it is often 
the case that our work involves libraries comprised of hundreds of thou- 
sands or even millions of lines of code, the knee-jerk reaction that this 
approach is much too cumbersome to take seriously is understandable. 
Furthermore, the momentum of established practices and “traditional 
wisdom” do little if anything to dissuade one from this line of reason- 
ing. Yet, as the result of our search for superior methods for designing 
and constructing high-performance parallel linear algebra libraries, we 
have come to the conclusion that it is only through the systematic ap- 
proach offered by formal methods that we will be able to deliver reliable, 
maintainable, flexible, yet highly efficient matrix libraries even in the 
relatively well-understood area of (sequential and parallel) dense linear 
algebra. 

While some would immediately draw the conclusion that a change to 
a more modern programming language like C-|— I- is at least highly de- 
sirable, if not a necessary precursor to writing elegant code, the fact is 
that most applications which call packages like LAPACK [2] and ScaLA- 
PACK [3] are still written in Fortran and/or C. Interfacing such an 
application with a library written in C-t— I- presents certain complica- 
tions. However, during the mid-nineties, the Message-Passing Interface 
(MPI) introduced to the scientific computing community a programming 
model, object-based programming, that possesses many of the advan- 
tages typically associated with the intelligent use of an object-oriented 
language [18]. Using objects (e.g. communicators in MPI) to encapsu- 
late data structures and hide complexity, a much cleaner approach to 
coding can be achieved. Our own work on the Parallel Linear Algebra 
PACKage (PLAPACK) borrowed from this approach in order to hide de- 
tails of data distribution and data mapping in the realm of parallel linear 
algebra libraries [20]. The primary concept also germane to this paper is 
that PLAPACK raises the level of abstraction at which one programs so 
that indexing is essentially removed from the code, allowing the routine 
to reflect the algorithm as it is naturally presented in a classroom setting. 
Since our initial work on PLAPACK, we have experimented with similar 
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interfaces in such seemingly disparate contexts as (parallel) out-of-core 
linear algebra packages and a low-level implementation of the sequential 
Basic Linear Algebra Subprograms (BLAS) [15, 6, 5, 16, 9]. 

Our Formal Linear Algebra Methods Environment (FLAME) is the 
latest step in the evolution of these systems [11]. It facilitates the use 
of a programming style which is equally applicable to everything from 
out-of-core, parallel systems to single-processor systems where cache- 
management is of paramount concern. 

It has become apparent that what makes our task of library devel- 
opment more manageable is this systematic approach to deriving algo- 
rithms coupled with the abstractions we use to make our code reflect 
the algorithms thus produced. Further, it is from these experiences that 
we can confidently state that this approach to programming greatly re- 
duces the complexity of the resultant code and does not sacrifice high 
performance in order to do so. 

Indeed, the formal techniques which we may have dismissed as merely 
academic or impractical make this possible, as we attempt to illustrate 
in the following sections. 

2. THE CASE FOR A MORE FORMAL 
APPROACH 

Ideally, an implementation should clearly reflect the algorithm as it is 
presented in a classroom setting. Even better, some of the derivation of 
the algorithm should be apparent in the code and different variants of an 
algorithm should be recognizable as slight perturbations to an algorith- 
mic “skeleton” or base code. If the code is just a mechanically-realizable, 
straightforward translation of the algorithm presented in class, there 
should be no opportunity for the introduction of logical errors or cod- 
ing bugs. (Note; while we will frequently refer to translations from 
algorithms to code as being mechanical or automatic, this process is 
currently performed by us by hand.) Presumably, it should be possible 
to prove the algorithms correct, thus ensuring that the code is correct. 

Typically, it is difficult to prove code correct precisely because one 
is not certain that the code truly mirrors the algorithm. With our ap- 
proach, the chasm is largely bridged by the simple yet crucial fact that 
some very simple syntactic rewrite rules can produce the code from an 
algorithm expressed as one might in a classroom, using mathematical 
formulae and stylized matrix depictions. Since we can prove the correct- 
ness of the algorithm we wish to employ and because the correctness of 
the translation from algorithm to code is at least as reliable as compiler 
technology, the complexity of the task at hand is greatly ameliorated. 
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By assuming that components adhere to explicit, “contractual obliga- 
tions” [1] the algorithmic proof requires little alteration in order to be 
applicable to the code. In the case of a library constructed entirely 
through the methodology presented here, these components would be 
composed in like manner so as to make this task manageable. This 
is largely due to the fact that the approach presented here leads to a 
software architecture layered in such a way so as to require one to rely 
on the correctness of a very small number of base-level modules. Since 
those units are small, their correctness can be established through the 
application of standard formal methods. It is true that, in practice, 
one must accept that an application will need to interface with other 
libraries (for example, the vendor-supplied BLAS) that are not built in 
a “proof-friendly” format. However, it may still be possible to establish 
the correctness of a program if one is careful to impose minimal obliga- 
tions on these, presumably time-tested and well-documented, pieces of 
code. 

It should be noted that the “correctness” discussed so far does not 
address issues of numerical stability. We make no claim regarding the 
stability of the resulting algorithm. 

Having said this, let us clarify through a simple example. 



3. A CASE STUDY: LU FACTORIZATION 

We illustrate our approach by considering LU factorization without 
pivoting. Given an n x n matrix A we wish to compute an n x n lower 
triangular matrix L with unit main diagonal and an n x n upper trian- 
gular matrix U so that A = LU. The original matrix A is overwritten 
by L and U in the process. 



3.1. POSSIBLE LOOP-INVARIANTS 



In order to prove correctness, we must ask is what intermediate value 
is in A at any particular stage of the algorithm. To answer this, partition 
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where A^j^, and U^l are all A: x A: matrices. Notice that “T”, “B”, 
“L”, and “R” stand for Top, Bottom, Left, and Right, respectively. 
Notice that 
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so the following equalities must hold: 
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Finally, let denote the matrix which holds the current intermediate 
result of a given algorithm for computing the LU factorization. In the 
following pages we will show that different conditions on the contents of 
Ak logically dictate different variants for computing the LU factoriza- 
tion, and that these different conditions can be systematically generated. 
Previewing this, notice that to compute the LU factorization, the sub- 
matrices of L and U are to be computed. We assume that Ak will contain 
partial results towards that goal. Here are some possibilities: 
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Here we use the notation L\U to denote a lower and upper triangular 
matrix which are stored in a square matrix by overwriting the lower and 
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upper triangular parts of that matrix. (Recall that L has ones on the 
diagonal, which need not to be stored.) 

In the subsequent subsections, we describe how to derive algorithms 
in which the desired conditions hold. Note that in this paper we will 
not concern ourselves with the question of whether or not the above 
conditions exhaust all possibilities. However, they do give rise to many 
commonly discussed algorithms. For example, they yield all algorithms 
depicted on the cover of, and discussed in, G.W. Stewart’s recent book 
on matrix factorization [19]. 

This comes as no surprise as we, like Stewart, have adopted some 
common implicit assumptions about both matrix partitioning and the 
nature of algorithmic advancement. In this paper we have restricted 
ourselves to a consideration of only those algorithms whose progress 
is “simple.” That is, each iteration of the algorithm is geographically 
monotonic and formulaically identical. The combination of these two 
properties leads to algorithms whose (inductive) proofs of correctness 
are straightforward and whose implementations, given our framework, 
are virtually fool-proof. 

While it is not a central focus of this paper, we also feel that it 
is worthwhile to point out om dissatisfaction with the categorization 
schemes which have traditionally been used to label algorithms of this 
class. In the literature one finds, for example, left- vs. right-looking 
algorithms [7]. The problem is that a left- looking algorithm for one 
operation has a very different flavor from a left-looking algorithm for 
another operation. Stewart, in [19] supplies novel naming conventions 
for some of these algorithms. We feel that the naming conventions can 
be made intuitively appealing and systematic, based on the work done 
in the inductive step of the algorithm. We intend to explain our classifi- 
cation further in a subsequent paper, after we have had an opportunity 
to evaluate it against a larger class of algorithms. 

3.2. LAZY ALGORITHM 

This algorithm is often referred to as a bordered algorithm in the 
literature. Stewart, [19] rather colorfully, refers to it as Sherman’s march. 

We only discuss blocked variants of the algorithms due to space limi- 
tations. In [11] we give details for unblocked algorithms. 

Algorithm: Assume that only (2) has been satisfied. The question 

is now how to compute Ak^b from Ak for some small block size b (i.e. 




Formal Methods for High-Performance Linear Algebra Libraries 199 



1 < 6 n). To answer this, repartition 



A = 
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where is kxk (and thus equal to A^j), and is bxb. Repartition 
L, U, and Ak conformally. Notice our assumption is that Ak holds 
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1 To compute we solve the triangular system L^Uqi = 

The result can overwrite 

2 To compute we solve the triangular system 

(k) 

The result can overwrite 

3 To compute and Uii we update <- — L^^Uqi = 

after which the result can be factored into 

( k ) 

and Uii using the unblocked algorithm. The result can overwrite 
Ak) 

^11 • 

The preceding discussion motivates the algorithm in Fig. 1(b) for over- 
writing the given n x n matrix A with its LU factorization. 



3.3. EAGER ALGORITHM 

This algorithm is often referred to as classical Gaussian elimination. 
Let us assume that (2), (3), and (4) have been satisfied, and as much 
of (5) as possible without completing any more of the factorization 
LbrUbr- Repartition A, L, U, and Ak conformally as in (6). Notice 
our assumption is that A^ holds 
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Partition A =f4N!#=') 

\ Abl\\Abr J 

where Atl is 0 x 0 
do until Ab/j is 0 X 0 



Repartition 

/ Atl\\Atr 
\ Abb||Abh 
where An 



^00 


-^01 


Aq 2 


-^10 


All 


Ai 2 


A20 


A21 


A22 , 




is 6 X 6 



(a) Eager: 

An ^ {L\U}n_= LU{An) 
A \2 4 - Ui 2 = L^^Ai 2 
A21 4 - L21 = A 2 iU~^^ 

4 22 ± Z..d22 - L2 1 U 12 



(b) Lazy: 

View Aoo as L\Uqq 
Aoi 4 - Loi = Lqq Aoi 
i 4 io 4 - Lio = AioUqq 

All ^ LWii = LU{An - LioUoi) 


(c) Row-lazy: 

View Aoo as L\Uqq 

Aio 4 - Lio = AiqUqq^ 

All 4 - L\U,^ = LU{An - LioUqi) 
Ai 2 U \2 = L’^i{A \2 - L10U02) 


(d) Column-lazy: 

View Aoo as L\Uqq 

Aqi <- Uoi = Uqq^ Aoi 

All ^ AC^ii = LU{An ~ LioUoi) 

A21 4 - L21 = {A21 - L2oUoi)U^^^ 


(e) Row-column-lazy: 

All 4 - = LU(An - LiqUqi) 

Ai 2 U \2 = L^^{Ai 2 - L10U02) 

A21 4 - L21 = (A21 - L 20 Uqi)U^^^ 



Continue with 

/ Atl\\Atr 
\ Abl\\-^br 
enddo 



Aoo 


Aoi 


Ao 2 


Aio 


All 


Ai 2 


, A20 


A21 


A22 , 



Figure 1 Blocked versions of LU factorization without pivoting for five commonly 
encountered variants. The different variants share the skeleton which partitions and 
repartitions the matrix. Executing the operations in one of the five boxes yields a 
specific algorithm. 
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2 To compute we update which already contains — 

LiqUq 2 by solving overwriting the original A^i 2 ■ 

3 To compute we update which already contains A^^ — 

L^Uqi^ by solving = A^i, overwriting the original 

4 We then update A^ which already contains A^ — L^qU^ with 
^ 22 ^ — L^iU^ overwriting the original A!^^ ■ 

The resulting algorithm is given in Fig. 1(a). 

3.4. OTHER ALGORITHMS 

We briefly discuss the three algorithms which may be derived from 
the remaining three loop-invariants. 

Row-lazy algorithm: As a point of reference, Stewart [19] calls this 
algorithm Pickett’s charge south. 

Let us assume that only (2) and (3) have been satisfled. Now it suflSces 
to compute L\U^^i, and Using the same techniques as before 
one derives the algorithm in Fig. 1 (c). Again, this algorithm overwrites 
the given n x n matrix A with its LU factorization. 

Column-lazy algorithm This algorithm is referred to as left-looking 
in [7] while Stewart [19] calls it Pickett’s charge east. 

Let us assume that only (2) and (4) have been satisfied. Now it suflices 
to compute L\U^ii, and L^i- Using the same techniques as before 
one derives the algorithm in Fig. 1 (d). Again, this algorithm overwrites 
the given n x n matrix A with its LU factorization. 

Row-column-lazy algorithm This algorithm is often referred to as 
Grout’s methods in the literature. 

Let us assume that only (2), (3), and (4) have been satisfied. This 

time, it suffices to compute and L^ 2 i, yielding the algo- 

rithm in Fig. 1 (e). Again, this algorithm overwrites a given nxn matrix 
A with its LU factorization. 

4. SO MANY ALGORITHMS, SO LITTLE 
TIME 

The primary motivating force behind developing a systematic frame- 
work for deriving algorithms is that, depending on architecture and/or 
matrix dimensions, different algorithms may exhibit different perfor- 
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mance characteristics. So, an algorithm which performs eidmirably on 
one architecture and/or a particular problem size may prove to be an 
inferior algorithm when implemented on another architecture or applied 
to a problem with dissimilar dimensions. 

In [9] we show that the efficient, transportable implementation of 
matrix multiplication on a sequential architecture with a hierarchical 
memory requires a hierarchy of matrix algorithms whose organization, 
in a very real sense, mirrors that of the memory system under considera- 
tion. Perhaps surprisingly, this is necessary even when the problem size 
is fixed. In the same paper we describe a methodology for composing 
these routines. In this way, minimal coding effort is required to attain 
superior performance across a wide spectrum of algorithms and problem 
sizes. Analogously, in [10] we demonstrate that an efficient implementa- 
tion of parallel matrix multiplication requires both multiple algorithms 
and a method for selecting the appropriate algorithm for the presented 
case if one is to handle operands of various sizes and shapes. In [16] we 
came to a similar conclusion in the context of out-of-core factorization 
algorithms. 

Taken together, these concerns and discoveries have motivated us to 
create an enabling technology for automating the entire process. What 
we have developed, in FLAME as it couples with our design philosophy, 
is a major step in that direction. 

5. FLAME: CODING THE ALGORITHM 

In an effort to make the code look like the algorithms given in Fig. 1, 
while simultaneously accounting for the constraints imposed by C and 
Fortran, we have developed FLAME. Rather than going into great detail 
regarding this library of routines, we give the FLAME code for the lazy 
algorithm in Fig. 2. 

The reader will recognize the skeleton which the five variants share, 
with only the code between the lines marked by/* ♦♦*•••*** */ differ- 
ing between implementations. A is being passed to the routine as a data 
structure. A, which describes all attributes of this matrix, such as di- 
mensions and method of storage. Inquiry routines like FLA_0b j .length 
are used to extract information such as the row dimension of the ma- 
trix. Finally, ATL, AGO, etc. are simply references into the original array 
described by A. 

If one is familiar with the coding conventions used to name the BLAS 
kernels, it is clear that by substituting different updates to submatrices 
in lines 24-35, one can achieve the different variants which implement 
LU factorization without pivoting. 
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1 

♦include “FLAME. h" 



5 



10 



15 



void FLA_LU„nopivot.lazy( FLA.Obj A, nb.alg ) 

{ 



FLA.Obj ATL, ATR, AOO, AOl, A02, 

ABL. ABR, A10» All, A12, 

A20, A21, A22; 

FLA.Part.2x2C A, AATL, /♦♦/ AATR, 

&ABL, /♦*/ &ABR, 

/♦ with ♦/ 0, /♦ by */ 0, /♦ submatrix ♦/ FLA.TL 
while ( b=min (FLA. Ob j. length ( ABR ), nb.alg) !» 0 H 



); 



20 



FLA.Repart.2x2.to.3x3( ATL, /♦♦/ ATR, 

/♦ ♦♦♦♦♦♦♦♦♦♦*♦♦ ♦/ 
/♦♦/ 

ABL, /**/ ABR 

/♦ with ♦/ b, /* by ♦/ b, /♦ All 



&A00, /♦♦/ &A01, &A02, 

/♦ ^ilfiif^ilfm***********^^** */ 

feAlO, /♦♦/ AAll, AA12, 
&A20, /♦♦/ &A21, &A22, 
split from ♦/ FLA.BR ) ; 



/Hf. 4c >!(>((« 4c 4! ♦♦lit 4c/ 

25 FLA.TrsmC FLA.RIGHT, FLA.UPPER.TRIANGOLAR, 

FLA.NO.TRANSPOSE, FLA.NONUNIT.DIAG, 

ONE, AOO, AlO ); 

FLA.TrsmC FLA.LEFT , FLA.LOWER.TRIANGULAR , 

FLA.NO.TRANSPOSE, FLA.UNIT.DIAG, 

30 ONE, AOO, AOl ); 

FLA.GemmC FLA.NO.TRANSPOSE, FLA.NO.TRANSPOSE, 

MINUS.ONE, AlO, AOl, ONE, All ); 

FLA.LU.nopivot.level2C All ); 

35 /* 4c4c4c4c4c******4c*4c4c4c4c4t4«4c4c4c4c4«**4c4«4c4c4c4c4c4c4e4nt«*>|c*4c4‘***>H*4c4c4c4c4c4e4t4!4«»»c4c4c4«4«4c4c 4c/ 
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} 



FLA.Cont.with.3x3.to.2x2( &ATL, /**/ &ATR, AOO, AOl, /♦♦/ A02, 

/♦♦/ AlO, All, /♦*/ A12, 

/4c 4c 4c 4c 4c 4c 4c 4c 4c 4c 4c 4c 4c 4c 4c 4c/ /4c 4c 4c 4c 4« 4c 4c 4c 4c 4c 4c 4c 4c 4c 4c 4c 4c 4c 4c 4c/ 

AABL, /4C4C/ &ABR, A20, A21, /♦*/ A22, 

f* with All added to submatrix */ FLA.TL ); 

} 



45 



Figure 2 A skeleton for the C implementation of many of the blocked LU factoriza- 
tion algorithms using FLAME. 



A number of issues related to this style of coding are discussed in 
detail in [11], including: 

Proving the code correct: We argue that the approach we take to 
deriving the algorithm allows the algorithm to be proved correct. We 
claim that since the code closely resembles the algorithm, the proof for 
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the algorithm largely carries over to a proof for the correctness of the 
code. 

Fortran: Again using MPI as an inspiration, a Fortran interface is avail- 
able for FLAME. Examples of Fortran code are available on the FLAME 
web page (see “Additional Information” at the end of the paper). 
Pivoting: We show how LU factorization with partial pivoting can be 
elegantly expressed in our algorithmic format. 

Performance: We show that the FLAME approach to coding linear 
algebra algorithms does not hinder high performance. 

6. RELATED WORK 

Advances in software engineering for scientific applications has often 
been led by libraries for dense linear algebra operations. These advances 
include interfaces like the BLAS and libraries like EISPACK [17], LIN- 
PACK [4], and LAPACK [2]. 

More recently, a great deal of work has been devoted to implementing 
the bulk of dense linear algebra algorithms by optimizing only matrix- 
matrix multiplication [14]. This work was based on a careful study of 
memory hierarchies and how best to construct an efficient implementa- 
tion of the entire BLAS library with a minimum of coding. FLAME has 
been used to quickly and reliably duplicate and extend these efforts. 

The true complexity of indexing became painfully obvious with the 
advent of distributed memory architectures and linear algebra libraries 
for these kinds of machines [3]. PLAPACK was developed in a response 
to dealing with this complexity and that process lead to many of the 
insights in this paper. 

A number of recent efforts explore hierarchical data structures for 
storing matrices [13, 12]. The idea is that by storing matrices by blocks 
rather than row- or column-major ordering, data re-use in caches can be 
enhanced. By combining this with recursive algorithms which exploit 
this data structure, impressive performance improvements have been 
demonstrated. Notice that these more complex data structures for se- 
quential algorithms introduce a complexity similar to that encountered 
when data are distributed to the memories of a distributed-memory ax- 
chitecture. Since PLAPACK effectively addressed that problem for those 
architectures, we have strong evidence that FLAME can be extended to 
accomodate more complex data structures for hierarchical memories. 

7. CONCLUSION 

In this paper, we have illustrated that a more formal approach to 
the design and implementation of matrix algorithms, combined with the 
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right level of abstraction for coding, leads to a software architecture 
for linear algebra libraries which is dramatically different than the one 
resulting from the more traditional approaches used by packages such as 
LINPACK, LAPACK, and ScaLAPACK. The approach is such that the 
library developer is forced to give careful attention to the derivation of 
the algorithm. The benefit is that the code is a direct translation of the 
resulting algorithm, greatly reducing opportunities for the introduction 
of common bugs related to indexing. Our experience shows that there is 
no significant loss of performance. Indeed, since more flexible algorithms 
can now be developed we often observe a performance benefit from the 
approach. 

Notice that throughout the paper we concentrate on the correctness 
of the algorithm. As we said earlier in the paper, this is not the same as 
proving that the algorithm is numerically stable. While we do not claim 
that our methodology will automatically generate stable algorithms, we 
do claim that the skeleton used to express the algorithm, and to im- 
plement the code, can be used to implement extant algorithms with 
known numerical properties. It also facilitates the discovery and im- 
plementation of new algorithms for which numerical properties can be 
subsequently established. 

The nature of the design process and the tight coupling between al- 
gorithm and implementation have many advantages. Currently, we are 
pursuing a project wherein we exploit this systematic approach in order 
to automatically generate entire (parallel) linear algebra libraries along 
with run-time estimates for the cost of the generated routines. 

Additional Information 

For more information; http://www.cs.utexas.edu/users/flame/. 
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DISCUSSION 

Speaker: Robert van de Geijn 

Mladen Vouk : There is an initial gain in productivity if formal meth- 
ods axe used. However, to optimize the result, one needs to do additional 
work. What is the ratio of formal development effort to traditional effort, 
and what fraction of the traditional effort has to be added to optimize 
the “formal” code? 

Robert van de Geijn : Regarding the ratio between development 
time using a traditional approach versus our more formal approach, a 
full implementation of all sequential level-3 BLAS (excluding matrix- 
matrix multiplication) can easily require months of time on the part of 
an experienced programmer, and would require extensive testing before 
correctness were established. Using our approach, one person (myself) 
implemented all these operations in a matter of about 10 hours, including 
testing, and I will give $100 to anyone who finds a bug in the code. 
Regarding the additional optimization, the implementation of the LU 
factorization with pivoting took about an hour. Optimizing the two 
routines that really improved performance, the pivoting routine and the 
level-2 BLAS based LU factorization with pivoting, takes 10-20 lines of 
code, less than 30 minutes of coding. 

Fred Gustavson : A comment: I consider your formal approach to 
be a significant contribution because it captures in a succinct manner 
the underlying linear algebra theory, translating it into C and Fortran 
programs in an automatic way. Additionally, I point out that linear al- 
gebra algorithms can be expressed recursively and that you get different 
algorithms than the ones that your formal approach considers. Finally, 
it is possible to represent a matrix using a different data structure than 
those currently supported by C and/or Fortran. Expressing your formal 
algorithms in terms of these new data representations should lead to 
higher performance on today’s processors. 

Robert van de Geijn : In some sense recursion is covered since im- 
plementation of the smaller subproblems is not explicitly addressed, and 
is typically accomplished through recursive calls. Indeed, recursion was 
used to improve performance. A more direct treatment of this would 
be a nice addition to the theory. As for the new data representations, 
notice that the issues one has to deal are similar to those encountered 
on distributed memory architectures. Since much of the inspiration for 
FLAME came from our Parallel Linear Algebra Package (PLAPACK), 
we believe our formal approach can very elegantly and effectively address 
more complex data representations. 
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Margaret Wright : Developers of existing codes, like LAPACK and 
ScaLAPACK, might axgue that users do not need to re£id the underlying 
library code, and hence addressing readability is not important. 
Robert van de Ge^n : This is fine if one treats a library as a black 
box, but often modifications must be made to existing functionality. 
David Walker : Where is parameter and error checking performed in 
your matrix multiply code? 

Robert van de Geijn : In PLAPACK we separate the parameter 
checking into a separate routine, which can be called optionally. In 
FLAME we intend to take the same approach. 

Vladimir Getov : You mentioned that your implementation follow- 
ing the formal methods approach outperforms ScaLAPACK. Could you 
provide some insight as to why your performance results are better? 
Robert van de Ge^n : By coding at a higher level of abstraction, 
one can implement more complex algorithms, which attain better per- 
formance. While these same algorithms can sometimes also be imple- 
mented using more traditional coding styles, e.g., using ScaLAPACK, 
the indexing becomes sufficiently complex that the resulting codes are 
frequently not manageable. 




NEW GENERALIZED MATRIX DATA 
STRUCTURES LEAD TO A VARIETY OF 
HIGH-PERFORMANCE ALGORITHMS 

Fred G. Gustavson 

IBM T. J. Watson Research Center 
Yorktown Heights NY, USA 



Abstract We describe new data structures for full and packed storage of dense 
symmetric/triangular arrays that generalize both. Using the new data 
structures one is led to several new algorithms that save “half” the stor- 
age for symmetric matrices and outperform the current blocked based 
level 3 algorithms in LAPACK. We concentrate on the simplest forms 
of the new algorithms and show they are a direct generalization of LIN- 
PACK. This means that level 3 BLAS’s are not required to obtain level 
3 performance. The replacement for Level 3 BLAS are so-called kernel 
routines, see [1], and on IBM platforms they are producible from simple 
textbook type codes, by the XLF Fortran compiler. In the sequel I will 
label these “vanilla” codes. On Power3 with a peak performance of 800 
MFlops, the results for Cholesky factorization at order n > 200 is over 
720 MFlops and then reaches 735 MFlops at n = 400. Using conven- 
tional full format LAPACK DPOTRF with ESSL BLAS’s one first gets 
to 600 MFlops at n > 600 and only reaches a peak of 620 MFlops. The 
simple algorithms of LU factorization with partial pivoting for this new 
data format is a direct generalization of LINPACK algorithm DGEFA. 
Again, no conventional level 3 BLAS’s are required; the replacements 
are again so-called kernel routines. Programming for squared blocked 
full matrix format can be accomplished in standard Fortran through 
the use of three and four dimensional arrays. Thus, no new compiler 
support is necessary. Also we mention that other more complicated 
algorithms are possible; e.g., recursive ones. The recursive algorithms 
are also easily programmed via the use of tables that address where the 
blocks are stored in the two dimensional recursive block array. Finally, 
we describe block hybrid formats. Doing so allows one to use no addi- 
tional storage over conventional (full and packed) matrix storage. This 
means the new algorithms are completely portable. 



Keywords: generalized data structures, BLAS, ESSL, LINPACK, LAPACK 




212 ARCHITECTURE OF SCIENTIFIC SOFTWARE 



1. INTRODUCTION 

In [6], [9], Recursive Blocked Data formats were introduced as a re- 
placement for standard Fortran/C array storage. One of the key inno- 
vations was to see that storing a matrix as a collection of submatrices 
(e.g., square blocks of size NB) led to very high performance on today’s 
RISC type processors. Let a rectangular matrix A have M rows and N 
columns. For clarity, assume M = m*NB and N = n*NB. A consists of mn 
submatrices of size NB by NB. In [6], [9] we demonstrated that recursion 
(i.e., divide-and-conquer) should be used to order these mn blocks in 
storage. This storage arrangement leads to L2, L3, and memory block- 
ing automatically. However, the ordering of the blocks is non-linear and 
tables are needed to properly address these mn blocks. 

A simpler way to order the mn blocks is standard Fortran/C order; 
i.e., store the mn blocks either in column major or row major order. If 
one applies a recursive algorithm to this simpler block data layout one 
approximates an automatic blocking for L2, L3, memory; i.e., perfor- 
mance for this layout will approximate recursive blocked data layout, 
[6, 9]. The main benefit of the simpler data layout is that addressing 
of an arbitrary (i,j) element of A(0:M-1,0:N-1), namely, a{i,j) can be 
easily handled by a compiler and/or a programmer. Let 0 < i < M and 
0 < j < N and write i = / * NB + il and j = J *JiB + jl. If A is stored in 
column block major order and all blocks are in column major order then 
a{i,j) is in block I + m* J aX location zl -f NB * j\ of that block. Also, 
IBM Power3 processors have 3 fixed point units which are mostly idle 
during floating point computations. Hence the additional index compu- 
tations above can probably be overlapped meaning there will be little or 
no performance loss. 

For Level 3 algorithms, the basis of the ESSL (Engineering and Sci- 
entific Subroutine Library) are kernel routines that achieve peak perfor- 
mance when the underlying arrays fit into LI cache, [1]. If one was to 
adopt these new simpler data formats then BLAS’s arid LAPACK type 
algorithms become almost trivial to write. We will discuss an example 
in the next paragraph, and in Sections 2 and 3 will show two detailed 
examples. However, before closing this paragraph, I want to argue that 
the combination of using the new data formats with kernel routines is 
general and that for matrix factorization it overcomes the current perfor- 
mance problems introduced by having a non uniform memory hierarchy. 
We use the Algorithms and Architecture approach, see [1] to present the 
argument. 

1 Floating point arithmetic can’t be done unless it’s operands are in 
LI cache. 
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2 Two dimensional Fortran and C arrays do not map nicely into LI 
cache. 

(a) The best case happens when the array is contiguous and prop- 
erly aligned. 

(b) At least a three way associative cache is required when matrix 
multiply is being done. 

3 For peak performance all matrix operands must be used multiple 
times when they enter LI cache. 

(a) This assures that cost of bringing a operand into cache is 
amortized by its multiple re-use. 

(b) Multiple re-use of all operands only occurs when all matrix 
operands map well into LI cache. 

4 Each scalar a{i,j) factorization algorithm has a square submatrix 
counter part A(I:I+NB-1,J: J+NB-1) algorithm. 

(a) Golub and Van Loan’s “Matrix Computations” book, [5]. 

(b) LAPACK library. 

5 Some square submatrices are both contiguous and fit into LI cache. 

6 Dense matrix factorization is a level 3 computation. 

(a) Dense matrix factorization, in this context, is a series of sub- 
matix computations. 

(b) Every submatrix computation (executing any kernel routine) 
is a level 3 computation. 

(c) A level 3 computation is one in which each matrix operand 
get used multiple times. 

7 Map the input Fortran/C array (matrix A) to a set of contiguous 
submatrices that each fit into LI cache. 

8 Apply the appropriate submatrix algorithm. 

Points 1 to 3 are architecture facts about many of today’s processors. 
Point 4 to 6 are dense linear algebra algorithms facts. The book [5] , 
point 4a, gives a detailed listing of the scalar algorithms and describes 
(with references to the research literature) their block submatrix counter 
parts. The LAPACK library, point 4b, gives code for the submatrix 
counter part algorithms. However, the block submatrix codes of point 
4 use Fortran and C to input their matrices and so point 5 does not 
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hold. See page 739 of [6] for more details. Point 5 does hold for the 
new data structures described here. Assuming both points 5 and 6 hold, 
we see that point 3 holds for every execution of the kernel routines that 
make up the factorization algorithm. This demonstrates that near peak 
performance will be achieved. Points 7 and 8 suggest an obvious algo- 
rithm change that is justified by points 1 to 6. Point 7 is pure overhead 
for the new algorithms. Using the new data formats reduces this cost to 
zero. By only doing point 8 we see that we get near peak performance as 
every subcomputation of point 8 is a point fib computation. Note that 
each kernel call of the submatrix algorithm is a level 3 call and so, on 
average, every scalar of each submatrix gets used multiple times. Now 
there is one type of kernel routine that deserves special mention. It is the 
factor kernel. Neither LAPACK nor the research literature treat factor 
kernels in any depth. For example, LAPACK factor routines are level 2 
routines; they are named with sufiix TF2, and they call level 2 BLASes 
repetitively. On the other hand ESSL, [1], and more recently, [3, fi, 9] 
where recursion is used, have produced level 3 factor kernels. In the 
above argument we did not metion L2, L3 cache, memory, out-of-core. 
So, we argue that by storing the set of contiguous matrices recursively 
and that by using a related recursive algorithm that one automatically 
blocks for all these higher levels of the memory hierarchy. 

Take any vanilla code, say Gaussian elimination with partial pivot- 
ing or QR factorization of a M by N matrix A. This code has a block 
equivalent where the stride one distance is NB^ . The row stride for the 
vanilla code is LDA. For the block equivalent it is m*NB^. It is quite 
easy to write the block equivalent code from the vanilla code. In the 
vanilla code a floating point operation is usually a FMA (c = c -I- ah) 
which in the block equivalent is a call to a DGEMM kernel. Similar 
analogies exist; e.g., for b = b/a or 6 = 6 * a we have either a DTRSM 
or a DTRMM kernel. In the simple block equivalent codes we are led 
to one of the variants of UK order, [2]. For these types of algorithms 
the BLASes are simply calls to kernel routines. In reality, the BLASes 
disappear entirely. However, more complicated algorithms can be em- 
ployed; e.g., recursive algorithms. In that case, a BLAS’s call access 
several submatrix blocks, [9, fi]. So, like a traditional BLAS’s call, a 
blocking routine is required to call the kernel routines multiple times on 
different sub-matrix operands as the blocking is now variable. However, 
unlike a traditional BLAS’s routine no data copying is performed. This 
means performance improves. 

In the above paragraph, I indicated that new algorithms can be ob- 
tained from simple Dense Linear Algebra vanilla codes if one first in- 
troduces the new data formats. One way to do this, is to ask the user 
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to input his data in standard Fortran or C order with some additional 
storage appended below his standard array. Then the standard Fortran 
or C array could be transformed in-place to the block square format. 
Next the block equivalent of the vanilla code (with calls to standard 
ESSL type kernels) could be performed on the transformed square block 
column format. The performance should be superb. For example, on a 
200 MHZ IBM Powers with a peak performance of 800 MFlops, the per- 
formance of Cholesky factorization at order n > 200 is over 720 MFlops 
and then reaches 735 MFlops at n = 500. We did not include the cost 
of transforming the data to square blocked packed format. Using con- 
ventional full format LAPACK DPOTRF with ESSL BLAS’s one first 
gets to 600 MFlops at n > 600 and only reaches a peak of 620 MFlops. 

There is great resistance to changing data formats that have existed a 
long time. This is especially so for major programming languages such 
as Fortran and C; here we are considering the storage layout for a two di- 
mensional array. Also, libraries for dense linear algebra; e.g., LAPACK, 
LINPACK, and ESSL all support packed format symmetric/ triangular 
arrays. A compelling reason is portability: If one changes the input 
layout to a library subroutine/function then existing software that calls 
that library subroutine/function will not be operational. So, we intro- 
duce modifications to our new matrix data formats which we call BHF 
for block hybrid formats. We claim that an isosceles trapezoid is an 
appropriately general shape to describe the uni-processor algorithms of 
dense linear algebra. This trapezoid consists of an isosceles right triangle 
and a rectangle. Hence, we store the triangle part of the trapezoid in 
packed format and the rectangle part as a general matrix (full format). 
Using BHF solves the portability problem in most cases. 

Now I turn briefly to SMP algorithms for Dense Linear Algebra Codes. 
In [3], superb SMP performance was obtained using recursive blocking 
on matrices stored in standard Fortran order. The new square blocked 
full order storage could be used instead of standard Fortran order. In 
that case, performance should improve even further. Similar statements 
can be made for other dense linear algebra algorithms; e.g., Cholesky 
and general LU factorization. 

In Section 2 we describe both simple square blocked full formats and 
simple square blocked full hybrid formats for dense matrix arrays. We 
describe an algorithm for Gaussian Elimination with partial pivoting 
for the first type of storage. We give performance results on an IBM 
Power 3 processor for the hybrid format and compare them with the 
LAPACK DGETRF algorithm. Recently, other new block data formats, 
[8] have been discovered while working on this project, both at Uni-C 
and IBM Research. These have to do with triangular/symmetric arrays. 
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Figure 1 Standcird (left) and New Square Blocked (right) column major storage order 



Again the major difference between these new array formats is they 
more closely mirror standard storage formats than the recursive arrays 
which store the blocks in a non-linear fashion. To be precise, these 
symmetric/triangular arrays store the blocks either as block columns 
or in standard packed format order. In Section 3 we describe square 
blocked packed formats for Symmetric/Triangular arrays and show they 
generalize both the standard packed and full arrays used by dense linear 
algebra algorithms. We also a describe a hybrid version of the new data 
formats that allows our new algorithms to become fully portable. For 
each type of new storage format we describe an associated Cholesky 
factorization algorithm. Also, performance results on an IBM Power 
3 processor for both algorithms verses LAPACK’s DPOTRF/DPPTRF are 
presented. In both Sections 2 and 3 we give brief notes on how to 
program for these new formats. Equally brief are Sections 4 to 6 (on 
Vectors, Recursion, and BLAS) which describe how the new formats 
effect/are affected by these three subjects. Due to lack of space we 
regret not covering these topics in any detail. In Section 7 we briefly 
describe what kernel routines are. In Section 8 we give a brief Summary 
and Conclusions. 

2. SIMPLER SQUARE BLOCKED FULL 
FORMATS FOR MATRICES 

These new data formats are best described by an example. Let A be 
aM=llbyN=10 matrix with LDA = 12. In Fortran this matrix would 
be stored as shown in Figure 1 (left); the number in location (i,j) is 
o(i, j) storage position in A. 
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Let NB = 4 be the block size and suppose A is stored in column major 
block order. Here ml = nl = 3 and A is a 3 by 3 block matrix. Each 
square block is a NB by NB and contains a submatrix of A. In the new 
data format A would be stored as in Figure 1 (right) . 

Now suppose a user inputs his matrix A in standard Fortran or C 
order with some additional space in the array holding A directly below 
A. For example in Figure 1 (left) the minimum storage for A is LDA*n = 
120 double words. If the user supplied ns > 144 elements (extra storage 
of > 24 double words directly below A) one could transform Fortran 
storage order to the new square blocked full column order. Once this 
data transformation is completed one could execute the block equivalent 
of a standard vanilla code with calls made to standard kernel routines. 

2.1. SQUARE BLOCKED FULL DATA 

FORMAT GAUSSIAN ELIMINATION 

We briefly describe this procedure for a right looking Gaussian elim- 
ination with partial pivoting. In the vanilla version (NB = 1 here) the 
outer loop is on j = 0, n — 1 and for each j one flnds the pivot in col- 
umn j and swaps it with a(j , j). Then a(j+l:m-l, j) is scaled by the 
reciprocal of the pivot to form column j of L. Next, cols k = j + l,n — l 
are processed in two steps. Let k he a generic column. First, a swap 
of row j and the pivot row is made. Secondly, a DAXPY update is 
performed: a(j+l :m-l ,k)=a(j+l :m-l ,k)-a(j+l :m-l , j)*a(j ,k). For 
the block version refer to Figure 4. In the blocked version the outer loop 
is on bj = 0, nl - 1 and for each bj one factors a block column L*U = 
P*A(j :m-l, j :n-l) by calling kernel routine RGETF3. Then cols j-|-n6 to 
n — 1 are processed in three steps. Let k:k+ks-l be the generic block col. 
First, there is a forward pivot step. Next, there is a DTRSM computa- 
tion whose first four parms are ’L’, ’L’, ’N’, ’U’, done by kernel routine 
DLLNU4. Finally, there is a DGEMM update whose first two parms are 
’N’, ’N’ which is done by a series of calls to kernel routine DAB4. After 
these three steps there is a back pivot step. As just mentioned there are 
three kernel routines in the block equivalent (see Figure 4). They are 
a factor a panel of size m by n where n < NB kernel called RGETF3, a 
DTRSM kernel called DLLNU4, and a DGEMM kernel called DAB4. We 
mention that the factor kernel has the same function as LAPACK rou- 
tine DGETF2. However, it is a level 3 routine, done recursively, as the 
prefix R and suffix 3 indicates. Note that the vanilla routine, actually 
the Linpack routine DGEFA, does not have a back pivot step. So the 
blocked version does extra work which actually could be avoided. 
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Figure 2 Pull Hybrid Block column major storage order 



Now I turn to how DGEFB might be packaged as a subroutine in stan- 
dard Fortran. Following LAPACK, we suggest using the input format of 
DGETRF(M,N,A,LDA,IPIV,INF0). The new routine would have a nearly 
identical calling sequence: DBGTRF (M , N , A , LDA , IPIV , NSINFO) . The new 
input parameter is NS > n*LDA and it is combined with LAPACK output 
only parameter INFO; hence, the name NSINFO. If NSINFO is not suffi- 
ciently large, DBGTRF just returns by placing in -NSINFO the amount of 
storage necessary to apply the new block algorithm. If NSINFO is suffi- 
ciently large, the input storage is rearranged into square block format 
and then DBGTRF is executed using the new block algorithm. Like the 
LAPACK LWORK parm, the value returned in -NSINFO will be the value 
used by DBGTRF for good level 3 performance, namely ml*nl*NB^. 

2.2. SQUARE BLOCKED FULL HYBRID 
DATA FORMATS FOR MATRICES 

Let A be m by n where LDA > m; i.e., A is stored in column major 
order. Assume m > n. A similar result holds for n > m. Let nl = 
[n/NB] and n2 = n -I- NB — nl * NB. Partition the column space of A into 
nl — 1 pieces of size NB and a leftover piece of size n2 < NB. This new 
format represent A as a set of nl — 1 rectangles of size m * NB and a last 
one of size m * n2. We partition the row space of A into ml — 1 pieces 
of size NB and a leftover piece of size m2 < NB. Here ml = fm/NB] and 
m2 = m -I- NB — ml * NB. Matrix A is a ml by nl block matrix. Each 
block has a TRANS parm; i.e., it can be stored in column or row major 
order. In each of the nl rectangles the last LDA — m rows are stored last. 
For the original A we assume LDA « m. In Figure 2 we give matrix A 
associated with Figure 1. 
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2.2.1 A Matching BLAS 3 LU = PA Algorithm. It is 

almost as easy to code this algorithm as the DGEFB algorithm of Figure 4. 
Do to lack of space we do not include the code for this format. 

2.3. PROGRAMMING NOTES FOR SQUARE 
BLOCKED FULL FORMATS 

As mentioned in the introduction one addresses the block coordinates 
(/, J) and the local coordinates (i,j) within that block. So, by using 
three or four dimensional Fortran / C arrays, one can program for these 
formats. In Figure 4, we use a three dimensional array; we handle the 
block (/, J) addressing implicitly as 1-D addressing in the the third 
dimension. See Section 5 ahead on how to handle blocks that are stored 
recursively. The main programming difficulties arise in coding the factor 
kernel routines. For example, Gaussian Elimination with partial pivoting 
requires working with a column of blocks. Thus local offsets in a square 
block axe required; see Section 4 ahead. Because of this, the number 
of parameters, and generalization to algorithms for distributed memory 
processors, I predict that descriptors will eventually be used. In [7], I 
described some preliminary descriptor formats. 

2.4. PERFORMANCE FOR LU = PA 

We have tried several variants of solving LU = PA; e.g., left and right 
looking and recursive variants. Also, we have tried several variants of the 
new data structures. Here we show performance results on a 200MHZ 
IBM Power 3 processor with a peak MFlop rate of 800. Results are 
for Full Hybrid Block format. We do not do the data transformation 
immediately. The reason is that Fortran storage (column major order) 
is ideally suited for the factor part of the algorithm. After factorization 
it pays to do a data transformation. Not included here are the results 
of algorithm DGEFB. We remark that these results are similar. There are 
two plots. Both compare the block Linpack algorithm with LAPACK 
algorithm DGETRF. Note that the x-axis is log scale; we let n and m of 
(m, 100) range from 10 to 2000. For square matrices (Figure 2.4, left) 
the new code is 90 % to 10 % faster than DGETRF as n ranges from 
10 to 2000. Note that for very large n these codes are dominated by 
ESSL’s DGEMM code. The second graph (Figure 2.4,right) is for tall 
thin matrices; i.e., n is held fixed at size 100. Here, the new code is 
nearly twice as fast as DGETRF. 
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Figure 3 Performance for LU = PA 



3. SQUARE BLOCKED PACKED FORMATS 
FOR SYMMETRIC/TRIANGULAR 
ARRAYS 

This new format is a generalization of packed format for triangular 
arrays. It is also a generalization of full format for triangular arrays. The 
main benefit of the new formats is that they allow for level 3 performance 
while using about half the storage of the full array cases. In packed 
format, the elements of a triangular matrix is laid out in memory as 
follows : (see Figure 5) the numbers represent the position within a 
data array. 

For square blocked packed formats there are two parameters NB and 
TRANS where usually n > NB. For this format, we first choose a block 
size NB and then we lay out the data in squares of size NB. Each square 
block can be in row major order (TRANS = ’T’) or column major order 
(TRANS = ’N’). This format supports both uplo = ’U’ or ’L’. For uplo = 
’L’, the first vertical stripe is n by NB and it consists of nl square blocks 
where nl = [n/NB] . It holds the first trapezoidal n by NB part of L. The 
next stripe has nl — 1 square blocks and it holds the next trapezoidal 
n - NB by NB part of L, etc. until the last stripe consisting of the last 
leftover triangle is reached. There are nl(nl + l)/2 square blocks in 
all. An example of Square Blocked Lower Packed Format (with TRANS 
= ’T’) is given in Figure 6(left). Here n = 10, NB = 4 and TRANS = 
’T’ and the numbers represent the position within the array where a(ij) 
is stored. Note the missing numbers (e.g., 2, 3, 4, 7, 8, and 12 which 
correspond to the upper right corner of the first stripe). This square 
blocked lower packed array consists of 6 squaxe block axrays: the first 
three are 4 by 4, 4 by 4, and 2 by 4. The next two are 4 by 4 and 2 by 4. 
The last square block is 2 by 2. Note the padding, which is done for ease 
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subroutine dgefb (m,n,a,nb ,ipiv,info) 

implicit none 

integer *4 m,n,nb ,info 

real*8 a(0;nb-l,0:nb-l,0:*) 

integer *4 ipiv(0:l,0:») 

real«8 one, zero, t 

parameter (one«l . dO , zero*0 . dO) 

integer*4 nb2,ml,nl,m2,n2, j ,k,i,bj ,bk,bi,ms,k8,ns 

integer*4 iti,ibi,iai,ici ! pointers 

info»0 

nb2«nb*nb ! size of a square block 
ml*(m+nb“l)/nb ! row order of block matrix 
nl*(n+nb-l)/nb ! col order of block matrix 
m2»m't-nb‘’ml*nb ! row size of last block 
n2sn-*‘nb~nlenb ! col size of last block 
bj«0 ! block j index 
do j*0,n-l,nb 

factor column swath A(j : j :m-l, j : j+nb-1) 
bk*bj 

iti*bj+ml*bj ! (0,0)-th element of block iti is A(j,j) 
ns^nb 

if (bj .eq.nl-l)ns«n2 

call rgetf3(m-j ,ns,a(0,0,iti) ,nb,ipiv(0, j) ,info) 
globalize pivot indices 
do i«j,j+ns-l 

ipiv(l,i)*ipiv(l,i)+bj 

enddo 

do k«j+nb,n-l,nb 

bk«bk+l ! block k index 
ks^nb 

if (bk . eq . nl-l ) k8»n2 

ibi*bj+ml*bk ! (0,0)-th element of block ibi is A(j,k) 

forward pivot A(j : j+m-l,k;k+ks-l) ! a(0,0,bk*ml) -> A(0,k) 

call dlaswpb(ks,a(0,0,bk«ml) ,nb, j , j-»-ns-l,ipiv,l) 

solve L*X * A(j ; j+nb-1, kik+ks-l) 

call dllnu4(nb,ks,a(0,0,iti) ,nb,a(0,0,ibi) ,nb) 

bi«bj 

do i»j+nb,m-l,nb 
bi=bi+l 
ms^nb 

if (bi . eq.ml-l)ms«m2 

iai*bi+ml+bj ! (0,0) -th element of block iai is A(i,j) 
ici*bi+ml*bk ! (0,0) -th element of block ici is A(i,k) 

update A(i:i+ms-l,k:k+k8-l) « A(i:i+ms-l,k:k+k8-l) 

- A(i:i+ms-l,j :j+jb-l)*A(j : j+nb-1, k:k+ks-l) 

call dab4(ms,ks,nb,a(0,0,iai) ,nb,a(0,0,ibi) ,nb, 
k a(0,0,ici) ,nb) 

enddo 
enddo 
bi*0 

do i»0,bj-l 

back pivot A(j : j+m-l,i:i+nb-l) ! a(0,0,bi) -> A(0,i*nb) 
call dlaswpb(nb,a(0,0,bi) ,nb, j , j+ns-l,ipiv,l) 
bi*bi+ml 
enddo 

bj*bj+l ! next block j col 
enddo 
return 
end 



Figure 4 DGEFB subroutine. 
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Figure 6 Square Blocked Lower (left) and Upper (right) Packed Format 



of addressing. Addressing this set of six square blocks as a composite 
block array is straight forward. An example of Square Blocked Upper 
Paeked Format (TRANS = ’N’) is given in Figure 6(right). The blocked 
upper packed array consists of 6 square block arrays: the first is 4 by 
4. The next two are 4 by 4. The last three are 4 by 2 4 by 2, and 2 
by 2. Each block is in column major order. Note the padding, which is 
done for ease of addressing. Addressing this set of six square blocks as 
a composite block array is straight forward. 

Here is another important point. With extra storage appended di- 
rectly below a standard packed array one can move to these new data 
formats without extra storage. For the examples above, AP requires 55 
storage elements. If there is 96 - 55 = 41 free locations below AP then 
one can move the packed array downward into the block packed array by 
stairting at the end of AP and moving the square blocks in a block col- 
umn into a buffer of of size nl*NB^ either in row major or column order. 
The entire buffer is then copied back over the vacated block column. 
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Figure 7 Squcire blocked packed formats when NB = n 



I believe a main innovation in using the square blocked packed format 
is to see that one can translate verbatim a standard packed factorization 
algorithm into a square blocked packed algorithm by replacing each ref- 
erence to an i,j element by its corresponding square block submatrix. 
Because of the storage layout, the beginning of each block is easily lo- 
cated. Also key is that this format supports level 3 BLAS. Hence old 
packed code is easily converted into square blocked packed level 3 code. 
In a nutshell, I am keeping ’’standard packed” addressing so the library 
writer/user can handle his own addressing in a Fortran / C environment. 

Now turn to full format. We continue the example with N = 10, and 
LDA = 12. Just set NB = LDA = 12 and one obtains full format; i.e., 
square block packed format gives a single block triangle which happens 
to be full format (see Figure 7). 

In Figure 7 we ignore the last NB — n columns of the square blocked 
array. Here is an interesting observation. The unused storage of size 
n * (LDA + LDA — n — 1 ) /2 consists of n fragmented vectors of sizes LDA — n : 
LDA — 1:1. I use colon notation, see [5]. These vectors are interspersed 
with 1 : n : 1 vectors of the symmetric matrix A. For uplo = ’U’ the 
symmetric matrix consists of ten vectors of sizes 1 to 10 in steps of 1 (55 
elements total). The unused storage consists of ten vectors of sizes 11 to 
2 in steps of -1 (65 elements total) It is my opinion that users of dense 
linear algebra codes do not utilize this fragmented storage. If this is so, 
one could convert full format to square blocked packed format thereby 
freeing up a contiguous block of storage. 
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3.1. A CHOLESKY ALGORITHM FOR 

SQUARE BLOCK PACKED FORMAT 

Now I turn to programming dense linear algorithms in the new for- 
mats. As an example, Figure 8 give uplo = ’L’ code for DPSTRF (PS stands 
for positive definite symmetric) which produces the lower Cholesky fac- 
tor for positive definite symmetric A where A is in square blocked packed 
lower transposed format. Algorithm DPSTRF is a simple right looking al- 
gorithm as the code illustrates. One could let NB = 1 and then each level 
3 square block kernel call becomes a corresponding scalar operation on 
an i,j element. For NB = 1 this routine is a variant of Linpack rou- 
tine DPPFA. Routine DP0FU4 is a ESSL factor kernel. The corresponding 
scalar operation is square root. In ESSL, all level 3 BLAS’s and factor- 
ization routines use kernel routines. For example, in ESSL’s DGEMM, a 
blocking routine is called to partition the matrix operands, A and B into 
submatrices (matrix blocks) and then calls axe made to kernel routines 
that operates on the blocks. Data copying of the operands to the kernel 
routines is decided on by the ESSL DGEMM blocking routine. Now in 
DPSTRF we can call the kernel routine DATB4 directly, thereby avoiding 
copying, since NB was chosen for good LI cache behavior. The routine 
DSLVL4 is a DTRSM kernel routine and routine DTATA4 is a DSYRK kernel 
routine. The suffix 4 on each kernel routine indicates that 4 by 4 register 
blocking (loop unrolling by four) is being used. These kernels have no 
clean-up code. So, when the order of A is not a multiple of 4 we pad 
the leftover blocks (up to 3 rows and columns with zeroes and the up to 
3 diagonals with ones). Thus the code works of any matrix order, and 
the kernel routines are short and of very high performance. We mention 
that the kernel routines axe programmed in Fortran. Note that routine 
DPSTRF is just one example of the general schema. Points 7 and 8, of 
the introduction. In Figure 3.4(left) we give the performance of DPSTRF 
verses LAPACK DPOTRF. 

Before closing I want to suggest a way to package this data storage 
in LAPACK. I continue with the current routine for Cholesky factoriza- 
tion, uplo = ’L’. For uplo = ’U’ a similar procedure would be followed. 
Define a new LAPACK routine called subroutine DPSTRF (UPLO , N , AP , 
NSINFO) . Input parameters UPLO , N, AP have the same meaning as the 
corresponding parameters of LAPACK routine DPPTRF and hence do not 
need any description. The new input paxm NS stands for the storage the 
user inputs for AP; NS > n{n + l)/2. Hence, the name NSINFO. On 
output NSINFO plays the role of LAPACK output only paxm INFO. If NS 
is not sufficiently large, DPSTRF just returns in -NSINFO the amount of 
storage necessary for good level three performance. Like the LWORK 
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! Squ€u:e Blocked Lower Packed Transposed Format Cholesky Factor ! 
subroutine dpstrf (uplo,n,nb,a,info) 
implicit none 

character*! uplo ! only uplo » and trans » is handled 
integer*4 n,nb,info ! mod(n,4) » 0 is assumed 
real«8 one,a(*) 

integer*4 j ,k,i,kb,ib, j j , jk,kk, ji,ki,nb2,nl,n2,nn 
parameter (onesl.dO) 

inf 0*0 

nb2*nb*nb ! size of a square block 
n2*(n-t‘nb-l)/nb ! order of block matrix 
nl*n2*nb2 ! block Ida 
jj*l ! -> a(j,j), j*l 
do j*l,n-nb,nb 

♦ factor a(j :j+nb-l,j :j+nb-l) 

call dpofu4(a(j j) ,nb,nb,info) ! factor kernel 
if (info. gt.O) goto 30 

ji*jj 

do i*j+nb,n,nb 
ji*ji+nb2 
ib*min(n-i+l ,nb) 

♦ solve a(j : j+nb-1, j : j+nb-l)**T*u(j : j+nb-l,i:i;ib-l) « 

♦ a(j : j+nb-l,i:i+ib-l) for u 

call dslvl4(a(ji) ,nb,ib,a(j j) ,nb,nb) ! trsm kernel 
enddo 

♦ initialize pointers for the k loop 
kk*jj ! -> a(k,k), k*j 

nn*nl ! Ida of k block. k=j 
jk*jj ! -> a(j,k), k*j 
do k*j+nb,n,nb 

♦ update pointers for the k loop 

jk*jk+nb2! -> a(j,k) 
kk*kk+nn ! -> a(k.k) 

kb*min(n-k+l ,nb) 

♦ update a(k:k+kb-l,k:k+kb-l) * a(k:k+kb-l,k:k+kb"l) 

♦ ” a(j : j+nb-l,k:k+kb-l)**T*a(j : j+nb-l,k:k+kb-l) 
call dtata4(kb.nb.a(jk) .nb.a(kk) ,nb) ! syrk kernel 
ji=jk ! -> a(j,i) i*k 

ki*kk ! -> a(k.i) i=k 
do i=k+nb,n,nb 

ji*ji+nb2 ! -> a(j,i) 
ki*ki+nb2 ! -> a(k,i) 

ib*min(n-i+l,nb) ! block a(k,i) has size nb by ib 

♦ update a(k:k+kb-l,i:i+ib-l) * a(k;k+kb-l,i:i+ib-l) 

♦ " a(j : j+nb-l,k:k+kb-l)**T*a(j : j+nb-l,i:i+ib-l) 

call datb4(kb,ib,nb,a(jk) .nb.a(ji) ,nb,a(ki) ,nb) ! gemm kernel 
enddo 

nn*nn-nb2 ! Ida for next time through loop 
enddo 

♦ update pointers for the j loop 

jj“jj+nl ! -> a(j,j), j*j+nb 

nl*nl-nb2 ! Ida for next j block 

enddo 

♦ factor a(j:n,j;n) 

call dpofu4(a(j j) ,nb,n-j+l,info) ! factor kernel 

if (info. gt.O) goto 30 

return 

30 info*info+j-l 
return 
end 



Figure 8 DPSTRF subroutine. 
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Figure 9 Blocked Hybrid Lower Packed Format 



parm, the value returned in -NSINFO will be the value used by DPSTRF 
for good level 3 performance. If NS is sufficiently large, the input storage 
is rearranged into square block format and then DPSTRF is executed 
giving level three performance. 

3.2. BLOCK HYBRID FORMATS OF 

SYMMETRIC/TRIANGULAR ARRAYS 

This new format is a combination of traditional packed and full arrays. 
It retains the main benefit of the new formats: it allows for level 3 
performance while using exactly the same storage as the packed routines 
do. Thus, no extra storage is required and so it is possible to obtain full 
portability with the existing packed routines. Its parameters are NB and 
TRANS. In block hybrid format (BHF) , we first choose a block size 
NB and then we lay out the data in trapezoidal swaths. For both uplo 
=’U’ or ’L’ there is also parameter TRANS. Now, for an example, see 
Figure 9 where n = 10, NB=4, uplo = ’L’ and TRANS =’T’. In general, 
the first trapezoidal swath has base n2 and sides n and n — n2 + 1. 
It consists of a packed triangle of size n2 and a rectangle consisting of 
nl — 1 blocks where nl = [n/NB] and n2 = n + NB — nl * NB. The 
remaining nl — 1 trapezoidal swaths have full base width of size NB. 
Now each base h by sides (6, 6 — h + 1) trapezoid consists of a packed 
triangle of size nt = h{h + l)/2 and an appended rectangle of size b — h 
by h. The trapezoid contains ntr = h{2b — h + l)/2 points, the rectangle 
nr = h{b — h) points, and the triangle nt points. Since ntr = nr + nt 
no extra storage is required to store a trapezoid as a packed triangle 
and an appended rectangle. Each triangle and rectangle can be stored 
either in row or column major order (TRANS = ’N’ or ’T’). Note that 
the LDA of the rectangles (set of “squares”) will be either NB or n2. 
There appear to be fom packed triangles because we have four cases 
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(’L’, ’N’) , (’L’,’T’) , (’U’,’N’) , (’U’,’T’). However, the layouts for packed 
(’L’,’N’) and (’UVT’) formats are identical as are the layouts for packed 
(’U’,’N’) and (’L’,’T’) formats. In the former case we have traditional 
lower packed format; in the latter traditional upper packed format. Now 
turn again to Figure 9. There axe three trapezoidal swaths. In the first 
swath of size n2 = 2 there is an upper packed triangle of order 2 and a 
rectangle consisting of two “squares” each of size 4 by 2 stored rowwise 
(TRANS =’T’). The remaining two trapezoidal swaths, two and three, 
are each trapezoidal swaths of size NB=4. The second trapezoid consists 
of an upper packed triangle of order 4 and a rectangle consisting of a 
single square of size four. The last trapezoid consists of upper packed 
triangle of order 4 and a rectangle consisting of no squares. 

3.2.1 A Matching BLAS 3 Cholesky Algorithm. The 
Linpack and LAPACK algorithms for DPPFA and DPPTRF are left looking 
when uplo = ’U’. The LAPACK algorithm for DPPTRF is right looking 
when uplo = ’L’. Linpack does not have a Cholesky algorithm when 
uplo = ’L’. However, these algorithms are not suited for our BLAS 3 
implementation of Cholesky using BHF. We only describe the uplo = 
’L’ algorithm here. We choose to mimic the uplo = ’U’ algorithm of 
LAPACK DPOTRF. This algorithm could be called hybrid since it has 
both right and left looking characteristics. It is better to choose the 
uplo = ’U’ algorithm because all DGEMM computations become ’T’ , 
’N’ instead of ’N’ , ’T’ ; see [4], for details. However, there are stronger 
reasons to choose the hybrid algorithm. First, we choose it because each 
block triangle is updated, factored, and does all its scalings in the outer 
J loop of our block hybrid Cholesky (BHC) algorithm which we now 
describe briefly. There are nl = [n/NB] passes through the outer J loop. 
To be able to only use BLAS 3 we need each blocked packed triangle 
to be in FULL format. We use a buffer T of size NB^ to copy a packed 
triangle to full format in T. At the beginning of a pass through the J 
loop we stcirt a K loop that calls DSYRK and DGEMM to update T 
and the rectangle (consisting of a set of squares - inner I loop) below T. 
Next, T is factored by kernel routine DPOFU. After factoring T, DTRSM 
is called to scale the rectangle (set of squares - I loop) beneath T. The 
pass through the J loop ends by copying full T back to packed format. 
Note here that the corresponding DPPFA/DPPTRF algorithm would copy 
each T back and forth multiple times. The use of DSYRK is a kernel 
routine call. For DGEMM and DTRSM there is two and one rectangles, 
each consisting of a set of squares. Hence, a single call on two and one 
rectangles requires no data copying within the calls. Alternatively, one 
could call the DGEMM and DTRSM kernels several times, once for each 
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square in the set. Another reason to choose the hybrid algorithm has 
to do with how its matrix operands enter LI cache. Consider the kernel 
routine DPOFU and suppose the triangle T has order n. Let j be the 
outer loop variable. During the j — th pass of the loop a rectangle of size 
j{n+l—j) is accessed. A triangle above the rectangle of size j(j — l)/2 is 
no longer needed and another triangle to the right of the rectangle sf size 
(n — j){n — j + 1)/2 has yet to be a.ccessed. Note that these three figures 
have exax;tly n(n+l)/2 points. Now the rectangle has maximal area n^/4 
when j = n/2. However, using either the right or left looking algorithm 
leads to a maximal area of size n^/2. Now let us briefiy look at the 
overhead of BHC. The major cost is the copying of nl packed triangles 
of size NB to and from full format. A very minor cost is the overhead 
of calling the kernel routines. First, each kernel routine needs no error 
checking as it is an internal routine. There are nl(nl + 1)/2 submatrices 
and nl(nl + l)(nl + 2)/6 calls to kernel routines. These calls consist of 
nl calls to DPOFU, nl(nl — l)/2 calls each to DSYRK and DTRSM, and 
nl(nl — l)(nl— 2)/6 calls to DGEMM. Let us use an example to illustrate 
the tiny overhead. Assume, n = 1000 and NB=100 which is reasonable 
on IBM Power 3 machines. Now, nl = 10 and so there are 220 kernel 
calls. Let M = 1, 000, 000. Each DGEMM call consumes 2M Flops, each 
DSYRK and DTRSM call consumes M Flops and each DPOFU consumes 
approximately M/3 Flops. Now 10(M/3) +90M + 240M = 1000(M/3) 
which is the Flop count of Cholesky when n = 1000. Clearly, the calling 
overhead is tiny and so the overhead cost of BHC is the copying cost 
of packed triangles to full triangle and back (a total of 50,500 matrix 
elements). Of course, we have not include the cost of Point 7 in the 
Introduction. This cost consists of moving inplace n(n + 1)/2 = 500, 500 
matrix elements. 

In this section, we discussed a fully portable replacement for Linpack 
DPPFA and LAPACK DPPTRF. As stated, in the previous Section and 
the Introduction, the new algorithm is a direct translation of a vanilla 
point algorithm where each scalar operation is replaced by level 3 kernel 
operation that runs at nearly peak performance. And, we only used 
existing level 3 BLAS and kernel routine DPOFU. Actually, only the kernel 
routine parts of the level 3 BLASes were required. 

3.3. PROGRAMMING NOTES FOR SQUARE 
BLOCKED PACKED FORMATS 

Like ordinary packed formats the implementor of a packed format 
library code explicitly handles his own addressing; e.g. AP(IJ) points 
at a{i,j) in the packed array AP representing the symmetric/ triangular 
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axray A. We do the same thing for square blocked packed format and 
BHF; e.g. see Figure 8 where AP is dimensioned a(*). 

3.4. PERFORMANCE FOR CHOLESKY 

There are two plots. Both plot MFlops verses matrix order n. Note 
that the x-axis is log scale; we let n range from 10 to 2000. In the 
comparison of Square Blocked Packed Cholesky verses DPOTRF (Figure 

3.4, left) we do not include the cost of transforming the data format. 
This is perhaps unfair. Nonetheless, we did it to demonstrate what 
type of performance is possible. Note that DPSTRF shows some choppy 
behavior especially when n is small. The matrix orders where this occurs 
are not multiples of four. For example, when n = 70 the performance 
is about the same as n = 60. The explanation is DPSTRF is solving an 
order n = 72 problem and the MFlop computation is being done for 
n = 70. However, the kernel routines are much simpler when there is no 
fixup code. Note that DPSTRF is always faster than DPOTRF by as much 
as a factor of four when n = 60 and at least 15 % for n = 2000. In 
the comparison, (see Figure 3.4, right) for BHC verses LAPACK we give 
four graphs: BHC, BHC + data transformation, DPOTRF, and DPPTRF; 
we name these curves 1, 2, 3, and 4. For small n we do not use the data 
transformation. In fact, we wrote a packed format Cholesky factorization 
kernel for uplo = ’L’ when n is small. Note that the LI cache on Power 3 
holds 8192 double words and that this factor kernel peaks at 620 MFlops 
for n > 160. In Section 3.4 we showed that n^/4 is the effective cache 
size and so, n w 180. Note, the initial data transformation does cost 
something. The actual crossover happened at n = 230. For n < 230 
the curves 1 and 2 are identical. For n > 230, it pays to use the data 
transformation and the curves 1 and 2 separate. A fair comparison would 
be curve 2 verses curve 4. Curve 2 is, on the average, three times faster 
than curve 4. Also, curve 2 is much faster than curve 3 for small n (up 
to four times faster) and more than 10 % faster for large n. 

4. VECTORS 

Vectors generalize as follows. Each vector is a collection of subvectors. 
The subvectors have the same format as Fortran 77 vectors: X(0:NB- 
1:INCX). The ’’stride’ between two subvectors (either constant or vari- 
able) has to be determined. This vital information becomes part of the 
definition of the vector as a collection of subvectors. 
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Figure 10 Performance for Cholesky 



5. RECURSION 

What we have now is block based data structures stored in the con- 
ventional column major or row major order. However recursion requires 
a new way to store the blocks. To be able to do that we can address 
the blocks through the use integer tables. Since the blocks contain NB^ 
elements the number of blocks will be relatively small. This means that 
the additional storage for the tables will be tiny. 

6. HOW THE BIAS’S CHANGE 

The main thing to note is that data copying has been removed. A 
new set of BLAS’s namely factor kernels must be defined. However the 
conventional BLAS’s become simpler to write as there is no data copying 
nor data allocation to be considered. 

7. KERNEL ROUTINES 

A kernel routine for a level 3 BLAS or for a factorization routine is 
that piece of code that performs the fioating point operations. Vanilla 
codes for these routines are simple scalar codes consisting of three nested 
loops and are found in some text books. For example, the ’N’ , ’N’ case 
of vanilla DGEMM has statement T = T + A(i,k)*B(k,j) in the inner 
k loop and initially T = C ( i , j ) , etc. For the ’T’ , ’N’ case the inner loop 
statement is T = T + A(k,i)*B(k, j), etc. We would name these codes 
DAB amd DATB. Now the suffix 4 means the outer j , i loops are both 
unrolled by four. Hence, 16 independent dot products instead of one are 
being done by the high performance production versions of these vanilla 
codes. Thus, DAB4 and DATB4 used in Figures 4 and 8 are the suffix 4 
codes briefiy described above. Currently, they are being done by hand. 
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However, in principle, the kernel routines can be automaticjilly produced 
by a compiler and/or pre-processor for for IBM Power 3 processors. The 
other kernel routines DLLNU4 in Figure 4 and DP0FU4 , DSLVL4, DTATA4 
in Figure 8 are done in a similar fashion. Only kernel routine RGETR3 is 
more complex. It involves a combination of recmsion and kernel routines 
plus logic to handle the partial pivoting aspects. In [1] pages 569 and 
570 further details are given. Note however, that now we use 4 by 4 
unrolling instead of the 4 by 2 unrolling in [1]. 

8. SUMMARY AND CONCLUSIONS 

I have described several novel data formats for dense linear algebra 
and have described some novel simple algorithms that utilize these new 
data structures. I have relied on a heuristic that is the key factor gov- 
erning performance on processors with deep memory hierarchies, namely 
blocking or tiling. To be able to use this heuristic we have made use of 
the following fact from linear algebra. ’’Some point algorithms have a 
submatrix block formulation” What is lacking is many concrete demon- 
strations that these data formats improve the performance of standard 
linear algebra software that are typified by ESSL and LAPACK. This 
paper shows that the result is true for Cholesky factorization and for 
LU factorization with partial pivoting. Not presented are results on QR 
factorization. However, in [3] we present result that indicate this re- 
sult will be true. Finally, there is agreement that the new software that 
is being developed can become part of LAPACK and ESSL if sufficient 
gains in performance and or storage utilization are demonstrated. These 
preliminary results indicate that this will happen. 
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DISCUSSION 

Speaker: Fred Gustavson 

Richard Hanson : Does this compelling result carry over to architec- 
tures other than the Power3? 

Fred Gustavson : My collaborators Bo Kagstrom, Erik Elmroth, Isak 
Jonsson, and Per Ling at the University of Emea, Sweden, and Alexander 
Karaivanov, Minha Marinova, Jerzy Wasniewski and Plamen Yalamov at 
Uni-C in Lyngby, Denmark, have obtained similar results. In particular, 
the Uni-C group worked on recursive packed Cholesky and produced re- 
sults on the Intel Pentium III, the Compaq Alpha EV6, the SGI RIOOOO, 
the Sun Ultra Sparc II, and the HP PA-8500. 

Kristopher Buschelman : Whose responsibility is it for the construc- 
tion of these complicated data structures? That is, does the end user 
need to build a matrix into this format and then pass this data structure 
into your routines? 

Fred Gustavson : I think these new data structures are perhaps sim- 
pler. However, the end user does not need to construct them. Either 
a compiler or algorithm writer can support them. For example, the 
LU = PA algorithm described here accepts standard Fortran/C formats 
and converts this input into the new format. 

Robert van de Geijn : One real opportunity presented by the new 
hierarchical data structure lies in the possible link that can be made 
with sparse matrices arising from structured problems. A hierarchical 
matrix can be used to store the sparse matrix by pruning parts of the 
hierarchy corresponding to zero submatrices. 

Fred Gustavson : My approach works best for level 3 computations; 
e.g., matrix factorization. I want to assure that each matrix operand 
that enters LI cache gets used by the floating point unit(s) multiple 
times so as to amortize the cost of bringing that operand into LI cache. 
Masaaka Shimasaki : Could you explain the accuracy results of your 
numerical experiments? 

Fred Gustavson : The new algorithms have the same accuracy as the 
conventional algorithms. 

David Walker : Have you compared your approach with a modified 
LAPACK algorithm in which each of the blocks is contiguous in memory? 
Fred Gustavson : In some sense my algorithm is a modified LAPACK 
algorithm where the blocks are contiguous in memory. However, instead 
of calling LAPACK TF2 routines, I call level 3 factor kernel routines. 
And sometimes, I call kernel routines instead of Level 3 BLAS’s. And 
finally, I have changed the LAPACK algorithms in some cases, e.g., I 
have introduced recursive algorithms. 
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Vladimir Getov : Your approach is specific for architectures with 
memory hierarchies. How do you see the suitability of this approax:h to 
future architectures on a long-term basis? 

Fred Gustavson : I think architectmes in the future will still have 
memory hierarchies and hence the approach will remain valid. 
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Abstract The Fast Fourier Transform (FFT) algorithm that calculates the Dis- 
crete Fourier Transform (DFT) is one of the major breakthrough in 
scientific computing and is now an indispensable tool in a vast number 
of fields. Unfortunately, software that provide fast computation of DFT 
via FFT differ vastly in functionality as well as uniformity. A widely 
accepted Applications Programmer Interface (API) for DFT would ad- 
vance the field of scientific computing significantly. In this paper, we 
formulate an API for DFT computation that encompasses all the func- 
tionality that are offered by a number of popular packages combined, 
allows easy porting from existing codes, and exhibits a systematic nam- 
ing convention with relatively short calling sequences. 

Keywords: API, FFT, DFT, scientific computing. 

1. INTRODUCTION 

The Fast Fourier Transform (FFT) algorithm that calculates the Dis- 
crete Fourier Transform (DFT) is one of the major breakthrough in 
scientific computing and is now an indispensable tool in a vast number 
of fields. Unfortunately, software that provide fast computation of DFT 
via FFT differ vastly in functionality as well as uniformity. A widely 
accepted Applications Programmer Interface (API) for DFT would ad- 
vance the field of scientific computing in several major aspects. First, 
and most obvious, is that applications programmers can much easily 
port their codes to different platforms. Second, a standard API would 
further encourage vendors to produce implementations that are highly 
optimized. Third, a common API for fundamental computation tasks 
such as the DFT will lead to advances in computational methods and 
software that rely on DFT as a building block. 

Despite the tremendous benefit of a common API, that it does not 
exist indicates the amount of obstacles preventing its formulation and 
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adoption. The diverse functionality of existing software packages de- 
mand that a common API be somewhat inclusive in the capability that 
it offers. The non-uniformity of existing software packages demand that 
the common API be natural and amenable to easy porting to it. Finally, 
the common API must exhibit certain superiority over existing ones to 
motivate conversion effort of legacy codes. 

2. API OF POPULAR SOFTWARE 
PACKAGES 

We offer a brief survey of some popular packages: two from the public 
domain, two from hardware vendors, and two from mathematical library 
vendors. The softwaxe packages and libraries surveyed are, in alphabeti- 
cal order, Cray libsci, FFTPACK, FFTW, IBM ESSL, IMSL, and NAG. 
We discuss our findings in relation to functionality and portability. 

The first significant differences in functionality is that of the data 
length N that is supported. ESSL only accepts N that are factorizable 
into primes no larger than 11, as well as other minor conditions. While all 
other packages have some routines that support general N, most resort 
to some order algorithm when N is “inconveninet,” for example, 
that of a large prime. Only Cray libsci on Cray’s MPP platform (such 
as the T3E) and FFTW implemented order N log N algorithm for this 
situation. The second differences in functionality is that of handling 
strided data and multiple transforms via one single call. Only ESSL and 
FFTW support this. The third major differences in functionality is that 
of performing transform without the use of auxiliary storage. Only NAG 
supports this. 

By portability we mean the ease of an application that uses one li- 
brary to switch to another library. The paramount problem is that of 
data representation. The simple case of complex- valued input data illus- 
trates the problem. While most packages in this case expect the input 
be stored in an array of complex type, NAG requires that the real part 
and imaginary part of the data be stored separately in two real arrays. 
Porting from one library to another is difficult without an explict copy 
that wastes both execution time and memory. The crux of the data 
representation problem is that in DFT computations, more than one 
representation are possible that can exploit the special properties that 
are present in the input or output domain (e.g. real valued sequence, 
conjugate even sequence). Unfortunately, these representations are not 
easily portable. Since all packages adopt only one representation for 
each data domain, major portability problem exist whenever these rep- 
resentations differ. The second portability problem relates to auxiliary 
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storage area. The packages differ in their need for explicit reference to 
these areas, or the requirement of their sizes. The third problem is that 
of placement of results. Some packages supports by default an “in-place” 
computation where the result overwrites the input; while others supports 
“out-of-place” and, in the case of Fortran, expect the user to violate the 
non-aliasing requirement to obtain an “in-place” functionality. 

3. OBJECTIVES AND DESIGN OF 
PROPOSED API 

Our vision is that, for DFT computation, the scientific and engineer- 
ing community would eventually be able to rely on a common interface 
for which most platforms would provide highly optimized implementa- 
tions. Our objectives here are to provide a set of APIs that (1) supports 
a set of functionality that is a superset of the combined functionality 
of the packages surveyed and allows transition to the new API to be 
straightforward; (2) exhibits simpleness in the calling sequences; and (3) 
offers claxity in the naming convention and organization. 

3.1. DEAFAULT OPTIONS AND 
INDEPENDENT UPDATE 

We accommodate the seemingly contradictory objectives of (1) and (2) 
by a different style. We illustrate this style by the in-place computation 
of a single-precision, complex, one-dimensional, forward transform of 
length 32. 

Integer : : Dim = 1 , Len * 32 

MyJ)esc => DFTP_DefineJ)esc(Dim, Len, 

Dftp _Par am-Pr e c i s ionJleal ) 

Call DFTP_DFTJForward(MyJ)esc, X-inout) 

Suppose we would have sixteen transforms of length 32 and that the 
data is stored contiguously, then prior to the call to DFTP_DFT_Forward, 
we issue 

Call DFTP -Change JIumlrMsforms (My -Desc, 16) 

Briefly speaking, we organize a vast variety of functionality into dif- 
ferent options, each with a default. Instead of using multiple parameters 
in the computation to set choices for each of the many options, we issue 
a call only to a relevant option changing routine only when the default 
choice is inappropriate. (The example above suggests that the number 
of transformations is an option whose default is 1.) 
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3.2. EXPLICIT RECOGNITION OF DOMAIN 
AND REPRESENTATION 

As alluded to previously, data representation is a major API issue. We 
identify three independent components of data representation. When 
each component is recognized explicitly, we can accommodate systemat- 
ically the different representations adopted by all the packages. We first 
review the mathematical background before elborating on the represen- 
tations. 

The proposed API focuses on one mathematical transform only — 
the Discrete Fourier Transform DFT. The natural domain for a one- 
dimensional DFT is the space of complex periodic sequences: 

{wj}, wjEC, j = 0,±1,±2,..., 
and that there is a positive integer n such that 

Wj = Wn-\-j for all j. 

For general d-dimensional DFT, the natural domain is the d-dimensional 
periodic sequences 

^ C, ji, . . . , jd = 0, ±1, ±2, . . . , 

and that there are positive integers ni, n 2 , . . ., such that 

^hj2,-;jd ~ '^n\+ji,n2+j2,";nd+3d 
Our API considers the following (scaled version) transform: 

Tid — l n2 — lTii — l / d \ 

Zkuk2,...,kd = ^ £ • • • £ £ ’^31,32,:., 3d exp I 52'Ky/^'^jikilrii j , 

jd=0 J2=0 jl=0 \ 1=1 / 

for ki = 0, ±1, ±2, . . . , where o is an arbitrary real- valued scale factor 
and S is either -f 1 or — 1. There is a well-known disagreement in whether 
to define forward transform to be the case J = — 1 or the case (5 = -f 1. 
We will adopt the former and define the case (5 = -t-1 as the reverse 
transform. 

It can be easily seen that the result of the transform is also complex- 
valued periodic with the same period as the input data. There are many 
common situations in which the input data are more special than just 
being complex periodic, such as real periodic for example. Special input 
domains in general lead to special output domains as well. Moreover, 
different representations are possible for each domain. These complica- 
tions is a major reason why existing software have quite incompatible 
interfaces. We address the issues of different domains and representa- 
tions next. 
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Table 1 Domain Correspondence of ForwEtrd and Reverse DFT 



Input 


Output 


General 


General 


Real 


Conjugate Even 


Conjugate Even 


Real 



3.2.1 Domains. We consider three domains explicitly: 

1 General Sequences: This is the largest domain, namely the domain 
of complex- valued periodic sequence: 

€ C, and Wj+n = wy 

2 Real Sequences: This is the domain of real-valued periodic se- 
quences: 

^jid2vJ<i ^ ^j+n = 

3 Conjugate-Even Sequences: An even sequence (analogous to an 
even function) is one that is symmetric at the origin: ty_j = wy A 
conjugate-even sequence is one that is even up to conjugacy: 

j2,---dd}’ '^jij2,--;jd ^ — “^ji 

and 

'^-hi-hf-i-jd ~ J2vdd’ 

We do not consider more restrictive subdomains such as real-valued 
conjugate-even sequences, for example. The result of either forward 
or reverse transforms applied to data in one of the above subdomains 
yield result in another subdomain (see for example [1, 2]) as tabulated 
in Table 1. 

We now consider the representation of each of the three domains. 
There are in fact two components of representation. First, the domains 
of periodic (infinite length) sequences axe mapped bijectively to a space of 
finite-length sequences. Second, these finite-length sequences are packed 
into finite-length sequence of a built-in (native) data type. For example, 
a one-dimensional real- valued period-n sequence Wj, j = 0, ±1, ±2, . . ., 
is naturally mapped into an length-n sequence xq,xi, . . . ,Xn-i where 
Xj = Wj for 0 < j < n. This finite-length sequence x, in some situations 
have to be packed into a array X of type complex where 

Re(A(j)) = and Im(A'(j)) = 
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The distinction of the two components of representation may seem re- 
dundant, but is in fact necessary if we are to accommodate lucidly the 
varieties of current practices such as in-place routines (that is, output 
overwritting input) where source and result domains axe difierent (for 
example, the so called real-to-complex transforms). Moreover, explic- 
itly recognizing these two components leaves flexibility for additional 
representations to be supported in the future. 

3.2.2 Representation — Mapping to Finite Sequences. 

For each domain, we support mappings to finite real- valued or complex- 
valued (or both) sequences. The sequences below are denoted by 



with d positive periods 



m,n2,...,nd > 0. 

1 General Sequences. A general sequence is mapped naturally 
to a complex-valued finite sequence arjjjj.-jd) 0 < ji < n* for 
1 <i < d, where 

®jlJ2,—)jd ~ '*^ilJ2, —>jd' 

We also support a mapping of w to two real- valued finite sequences 
^ju 32 ,...,h yjij 2 ,...,jd> 0 < ji < nj for 1 < 2 < d, where 

®jl>j2,—Jd ~ y3li32,—Jd ~ ^^i‘^jld2,—,3d)' 

2 Real Sequences. A real sequence is mapped naturally to a real- 
valued finite sequence Xj^j 2 ,...,jd^ 0 < < n* for 1 < i < d, where 

^3l,32,-,jd ~ '^3i>32,—,3d‘ 

This is the only mapping we support for this domain. 

3 Conjugate-Even Sequences. A conjugate-even sequence can 

be mapped into a complex- valued sequence ®jx,j 2 ,.-,jd> 0 ^ ^ 

[ni/2j, and 0 < ji < Uj for 2 < i < d, where 

^3l>32,—,jd ~ ^ilj2,— Jd‘ 

This mapping is bijective because the complete periodic sequence 
can be reconstructed easily. Consider indices ji, 1 < i < d where 
[ni/2j < ji < ni and 0 < ji < rij otherwise. We have 

'^31,32,— <3d ~ '^ni-ji,k2n2-j2,—,kdnd-jd 
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by periodicity and conjugate eveness. It is easy to see that 
0 < ni — < [ni/2j, and 0 < ktrii — ji < rii, 2 <i < d. 

by chosing ki to be 1 when ji > 0 and ki to be 0 otherwise. Thus, 

~ ^ni-ji,k2n2-h,-,kdnd-id- 

When the dimension of the sequence is one, we support a mapping 
of w into a real- valued sequence Xj ^ , 0 < <n\, where 

Xj^ = Re(wji), ji = 0,l,...,[ni/2j, 

= MwjJ, = 1,2,..., L(ni - 1)/2J. 

This mapping is also bijective. wq is always real-valued for a 
conjugate-even sequence. When n\ is odd, — 1)/2J = L^i/2J. 
Thus, the x vector gives us Wj^ for 0 < y'l < L(ni/2)J. When n\ 
is even, simple symmetry shows that is real- valued as well. 

Thus, once again, the x vector contains tOjj for 0 < ji < [ni/2J. 
Consequently, the x sequence always yield 0 < ji < L^i/2J, 
from which the whole w sequence can be reconstructed by virtue 
of periodicity and conjugate eveness as discussed previously. 

3.2.3 Representation — Packing into Data Types. We 

consider packing {xji,j 2 ,-Jd}^ 0 < ji < Ji, 1 < i < d, a. real-valued or 
complex-valued multi-dimension sequence, into real-valued or complex- 
valued multi-dimension sequence {Xj^J 2 ,■■■,jd} data items in real or 
complex data type. Note that the Jj’s are not necessarily the periods 
of the sequence in question. The packing is a striaghtforward one-to- 
one correspondence in the case of real packed into real or complex into 
complex: 

^jl,j2,-;jd ~ ^jl,32,--;jd' 

The two other cases are as follows. 

1 Real Packed into Complex. A real-valued sequence 

^ 31,32, ■■■ddf ^ ^ ji ^ Jii 

is packed into data items of complex data type j 2 , -jd’ 0 ^ < 

[Ji/2], and 0 < ji < Ji, 2 < i < d, by 

Re(Ajj j2,...,id) ~ ^2ji,j2,...,jd1 J2v,jd) ~ ^2ji+lj2vijd‘ 

When Ji is odd, we further require that Im(A[-j,/ 2 l,j 2 ,j 3 , -d<i) ~ ® 
for 0 < ji < Ji, 2 <i < d. 
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Table 2 Representation and Defaults for All Domains 



Domain 


Representation 


General 


Complex into Complex (Default) 
Complex into Real 
Real into Complex 
Real into Real 


Real 


Real into Real (Default) 
Real into Complex 


Conjugate Even 


Complex into Complex 
Complex into Real (Default) 

Real into Complex (for one-dimension only) 
Real into Real (for one-dimension only) 



2 Complex Packed into Real. A complex-valued sequence 

0 ^ < Jit 

is packed into data items of real data type ^ < 2 Ji, 

and 0 < ji < Jii 2 < i < d, by 

.id ~ ^®(®ii.i2,— .id)’ -^2ii+i,i2.— .id ~ ^*^(^ii.i2.— .id)' 

3.2.4 Options in Representation. The two mapping choices 

coupled with the two packing choices result in a maximum of four pos- 
sible representations for some domains. These four representations are 
complex into complex (that is, mapped to complex, and then packed 
into complex), complex into real, real into complex, and real into real. 
For each domain, we pick as default one representation that is either 
most natural or most commonly adopted. Table 2 tabulates the options 
offered and the default. 

3.2.5 Stride — Data Structure. After a periodic sequence 

of a certain domain is represented by a finite-sequence of data items of a 
specific data type, these data items reside in an array. All arrays in our 
interface are considered flat. Generically, we consider the distribution of 
a finite sequence 

^jlj2,-,jd't 0 < ji < Jj, 1 < * < cf) 
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to a one-dimensional (fiat) array X of the same type. In general, we 
will be dealing with multiple transforms. That is, there are M such 
sequences 

^juj2,...,jd' 0 <ji < Ji, 1 <i <d, 0 <m < M. 

The distribution we adopt of these multiple sequences is that of uniform 
stride, which is specified completely by distances of neighboring ele- 
ments within each dimension or same elements belonging to neighboring 
sequences. Specifically, 



X(index(Jo; Li,L 2 , . . . , Ld+i;jij 2 j • • • Jdj ^)) = 



where 

d 

index(/o? L\ , L/ 2 ^ • • • ? ^d-f l ? Jl ? • • • ? Jd? ~ lo A" ^ jil^i “H ^-^d+l • 

i=l 

The value Iq is index of first element of X (that is default to be 1 in 
Fortran for array defined without an explicit beginning index) and we 
call the parameters L^’s generic strides. 

In the case of a single one-dimensional sequence, Li is simply the 
stride. The default setting of strides in the general muti-dimensional 
situation corresponds to the case where the sequences are distributed 
tightly into the array: 

Li = 1, 1,2 = Ls = J 1 J 21 • • • , Lfi = Jj, Ld^i = 

When the user needs to change this default, d+1 values must be supplied 
as an integer vector to the routine that changes the stride options. In 
case of a single transform, the value Lrf+i can be arbitrary, but a valid 
integer nevertheless. 

This explicit specification of strides offers much flexibility in repre- 
sentation of transposes and provide a straightforward correspondence to 
C/C++. Moreover, despite many advancements in Fortran compilers, 
explicit striding can still offer potential performance gain over the use of 
advanced language features in references to less common situations such 
as sections of arrays. 

3.3. ILLUSTRATIONS 

As an illustration, the case of representing complex input data (gen- 
eral domain) by two real vectors containing the real and imaginary 
part is the ‘‘Real into Real” representation denoted by the parameter 
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DFTP_Param_RepJlR. A code fragment to effect this (single precision) in- 
place transform is 

Integer : : Dim = 1 , Len = 32 

MyJDesc => DFTPJ)ef ineJ)esc(Dim, Len, 

DFTP -Par am_Pr e c i s i on_Re al ) 

Call DFTP -Change -Rep-General (My -Desc, DFTP-PaLram-Rep-RR) 

Call DFTP-DFT-Forward(My-Desc, XJle, X-Im) 

As another illustration, consider the case of representing real input 
data (real domain) in the default ‘‘Real into Real” and conjugate even 
data in the default “Complex into Real” representations. These default 
representations are natural for the default in-place transforms. For an 
input of N elements stored contigously, the input array must have length 
AT + 2 to accommodate the output data. This situation is common. A 
code fragment to effect this in-place transform is 

Integer : : Dim = 1 , Len = 32 

Real :: X-inout (Length+2) 

My -Desc => DFTP-Define -Desc (Dim, Len, 

DFTP -Par am-Pre c i s i on-Real ) 

Call DFTP -Ch 2 LngeJ)omain (My -Desc, DFTP-Pzuram-Domain-Real) 

Call DFTP -DFT-Forward (My -Desc, X-inout) 

3.4. OPTIONS SUPPORTED 

We now list the various options supported. Different subsets of the 
following list of options are supported by various software packages sur- 
veyed here. 

■ A scale factor for the trajisform. Default is 1. 

■ Multiple number of transforms. Default is 1. 

■ Placement of result. Default is in place. 

■ Domain of input (general, real, or conjugate even). Output domain 
is implied. Default is general. 

■ Representations for domain. Default defined for each domain. 

■ Workspace requirement. Default is to make use of workspace. 

In addition, we also propose two “advanced” options not commonly 
found. 

■ Allowance for reordering. This is an option that allows for the 
order of input or output to be scrambled (e.g. by a bit-reversal 
permutation). Default is no scrambling allowed. 
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It is well known that the a number of FFT algorithms apply an 
explicit permutation stage that is time consuming [3]. Doing away 
with this step tantamounts to applying DFT to input whose or- 
der is scrambled or resulting in scrambling the order of the DFT 
result. In applications such as convolution and power spectrum 
calculation, the order of result or data is unimportant and thus 
permission of scrambled order is attractive if it leads to higher 
performance. Our API allows the following four options: 

1 Input ordered, output ordered. This is the default. 

2 Input ordered, output scrambled. 

3 Input scrambled, output ordered. 

4 Input scrambled, output scrambled. (Computation equiva- 
lent to that of case 1.) 

All scramblings involved are digit reversal. Precisely, a length J 
is factored into K ordered factors D\, £> 2 , . . ., Dk- Any index i, 
0 < * < n, can be expressed uniquely as K digits ii, i 2 > • • 
where 0 < ii < Di and 



i = i\ + 12 D 1 + izD\D2 -!-•••+ inDiD2 • ■ • 
A digit reversal permutation scram(i) is given by 



scram(i) = iK+iK-iDK+iK- 2 DKDK-i-{ ^hDkDk-i •••£> 2 . 

Factoring J into one factor J leads to no scrambling at all, that is, 
scram(i) = i. Note that the factoring needs not correspond exactly 
to the number of “butterfly” stages to be carried out. In fact, the 
computation routine in its initialization stage will decide if indeed 
a srcambled order in some or all of the dimensions would lead to 
performance gain. The digits of the digit reversal are recorded and 
stored in the descriptor. These digits caix be obtained by calling a 
corresponding inquiry routine that returns a pointer to an integer 
array. The first element is which is the number of digits for 
the first dimension, followed by values of the corresponding 
digits. If dimension is higher than one, the next integer value is 
etc. 

We comment that simple permutation such as mod-p sort [3] is a 
special case of digit reversal. Hence this option could be useful to 
high-performance implementation of one-dimensional DFT via a 
“six-step” or “four-step” framework [3]. 
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■ Allowance for transposition. This is an option that allows for the 
result of an high-dimensional transform to be presented in a trans- 
posed manner. Default is no transposition allowed. 

Similar to that of scrambled order, sometimes in higher dimension 
transform, performance can be gained if the result is delivered in 
a transposed manner. Our API offers an option that allows the 
output be returned in a transposed form if performance gain is 
expected. Since the generic stride specification is naturally suited 
for representation of transposition, this option allows the strides 
for the output to be possibly different from those originally spec- 
ified by the user. Consider an example where a two-dimensional 
result 0 < jt < is expected. Originally the user specified 
that the result be distributed in the (fiat) airray Y in with generic 
strides Li = 1 and L 2 = ni. With the option that allows for 
transposition, the computation may actually retmn the result into 
Y with stride L\ = n 2 and L 2 — These strides are obtainable 
from an appropriate inquiry function. Note also that in dimension 
3 and above, transposition means an arbitrary permutation of the 
dimension. 

3.5. NAMING ORGANIZATION AND 
CONVENTIONS 

The design of the API leads naturally to four categories of routines: 

1 Definition routine. This defines a transform by its dimension, pe- 
riods, and precision required. There is only one routine, and the 
result is a data structure called descriptor that holds all the infor- 
mation and default settings. 

2 Option changing routines. Since all options have a default value, 
setting a different choice amounts to a change. Each of the options 
listed previously has associated with it an option changing routine 
whose effect is to update the descriptor. 

3 Computation routines. Since all the settings made are recorded in 
the descriptor, there is no need for different routine names for com- 
putation purposes. The only computation routines axe for forward 
transform and reverse transform. 

4 Inquiry routines. These routines allow retrieval of specific option 
setting as well as error conditions and side effects such as the actual 
scrambling performed should it be allowed. 
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The naming conventions is HTY? -category. otherinfo. For example, the 
definition routine is DFTP_Def ine_Desc, that is, defining a descriptor. 
All the option changing and inquiry routines have the form 

DFTP -Change .options, and DFTP_Inquire_options 

for options such as RepJleal and Stride-Input. 

4. API PROPOSAL 

The set of routines, derived types, and parameters are packaged in a 
Fortran 90 module named DFTPACK. Users must issue an “use” state- 
ment USE DFTPACK to maJce the routines and associated objects visible. 
The module includes the following: 

1 Derived Type: The module defines one derived type DFTP_Desc 
which is the type for the DFT descriptor. 

2 Parameters: The module defines a number of parameters that are 
used to specify settings of options. 

3 Routines: The module provides four kinds of routines: definition 
routine, setting changing routines, computation routines, and in- 
quiry routines. 

4.1. PARAMETERS 

The list of parameters are as follows: 

1 Precision specification parameters: 

■ DFTP_Param_Precision_Real 

■ DFTP-Param_Precision-Double 

2 Domain specification parameters: 

m DFTP-Param_Domain-General 

■ DFTP_Pauram_Domain_Real 

■ DFTP-Param-Domain.ConEven 

3 Domain representation parameters: 

m DFTP-Param_Rep-CC (Mapped to complex, packed into com- 
plex.) 

■ DFTP-Pauram-Rep-CR (Mapped to complex, paxiked into real.) 

■ DFTP-Pauram-Rep-RC (Mapped to real, packed into complex.) 
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■ DFTP-PauramJlepJlR (Mapped to real, packed into real.) 

4 Workspace specification parameters: 

m DFTP_Param_Workspace-Allow (Use when deemed appropri- 
ate.) 

■ DFTP-Param_Workspace.Avoid (Avoid if possible.) 

5 Ordering specification parameters: 

m DFTP_Param_0rdering_00 (Ordered input, ordered output.) 

■ DFTP_P2urain_0rdering_0S (Ordered input, scrambled out- 
put.) 

■ DFTP Jaram-Ordering-SO (Scrambled input, ordered output.) 

■ DFTP_Peu:ain_Ordering_SS (Scrambled input, scrambled out- 
put.) 

6 Transposition specification parameters: 

m DFTP_Param_Transposition_Allow (Traspose result when 
appropriate.) 

■ DFTP-Param_Transposition_Avoid (Avoid if possible.) 

4.2. DEFINITION ROUTINE 

FUNCTION DFTPJ)efine-Desc(Dimension, Periods, Precision) ft 
RESULK MyJJesc ) 

TYPE (DFTPJJesc), POINTER :: MyJ)esc 

INTEGER, INTENT(IN) :: Dimension, Precision 

INTEGER, DIMENSION (♦) , INTENT (IN) :: Periods 

Precision is specified by one of the precision specification parameter: 

■ DFTP-P 2 uram_PrecisionJleal, or 

■ DFTP_Param_Precision_Double 

4.3. SETTING CHANGING ROUTINES 

1 Scale Change 

The default scale is one. The routine name is overloaded. 

SUBROUTINE DFTP.Change^cale(MyJ)esc, New^cale) 

TYPE (DFTPJ)esc), INTENT (INOUT) , POINTER :: MyJ)esc 

REAL, INTENT (IN) :: NewJScale 
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SUBROUTINE DFTP.Change^cale(MyJ)esc, New-Scale) 

TYPE (DFTPJesc), INTENT (INOUT) , POINTER :: My-Desc 

DOUBLE PRECISION, INTENT(IN) : : New^cale 

2 Number of Transforms 

The default number of transforms is one. 

SUBROUTINE DFTP.Chemge-NimTransforms (My-Desc, New-Number) 
TYPE (DFTP-Desc), INTENT ( INOUT) , POINTER :: MyJ)esc 

INTEGER, INTENT (IN) :: NewJIumber 

3 Domain Change 

The default domain is general sequence. 

SUBROUTINE DFTP -Change-Domain (My .Desc, New-Domain) 

TYPE (DFTP-Desc), INTENT (INOUT) , POINTER :: MyJ)esc 

INTEGER, INTENT (IN) :: NewJDomain 

New-Domain is one of the domain specification parameters: 

■ DFTP_Param_Domain_General, or 

■ DFTP-Param-Domain-Real, or 

■ DFTP-Param_Domain_ConEven. 

4 Representation Change for a Specific Domain 

SUBROUTINE DFTP -ChemgeJlep-General (My-Desc, Neu-Rep) 

SUBROUTINE DFTP -Change-Rep Jleal (My-Desc, New-Rep) 

SUBROUTINE DFTP-Change-Rep-ConEven(MyJ>esc , New-Rep) 

TYPE (DFTP-Desc), INTENT (INOUT) , POINTER :: MyJ)esc 

INTEGER, INTENT (IN) :: New-Rep 

New-Rep is one of the domain representation parameters: 

■ DFTPJParam_Rep_CC, or 

■ DFTP_Paxam_Rep_CR, or 

■ DFTP_Param_Rep_RC, or 

■ DFTP-PauramJRepJlR. 

5 Stride Change 

The default stride is that of tight distribution. 

SUBROUTINE DFTP -Change-Stride-Input (My -Desc , New-Stride) 

SUBROUTINE DFTP -Change-Stride-Output (My-Desc , New-Stride) 
TYPE (DFTP-Desc), INTENT (INOUT) , POINTER :: MyJ)esc 

INTEGER, INTENT(IN), DIMENSION(*) New-Stride 
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New_Stride is a vector specifying the generic strides (Sec- 
tion 3.2.5). 

6 Workspace Change 

The default is to allow the use of workspace. 

SUBROUTINE DFTP.Change.Workspace (My J)esc , New.Workspace) 

TYPE (DFTP_Desc), INTENT ( INOUT) , POINTER :: MyJ)esc 

INTEGER, INTENT(IN) :: New.Workspace 

New.Workspace is one of the workspace specification parameters: 

■ DFTPJ>aram_WorkspaceJlllow, or 

■ DFIPJaram-WorkspaceJlvoid. 

7 Ordering Change 

The default is that of ordered input and ordered output. 

SUBROUTINE DFTP-Change.Ordering(MyJ)esc, New.0rdering) 

TYPE (DFTPJ)esc), INTENT ( INOUT) , POINTER :: MyJ)esc 

INTEGER, INTENT (IN) :: New.0rdering 

New_0rdering is one of the ordering specification parameters: 

■ DFTPJ*aram_0rdering_00, or 

■ DFTPJ>aram_0rdering_0S, or 

■ DFTPJ’aram.Ordering.SO, or 

■ DFTPJ’anram.Ordering.SS. 

8 Transposition Change 

The default is to avoid the output be transposed. 

SUBROUTINE DFTP.Change.Transposition(My.Desc, New.Transposition) 

TYPE (DFTPJ)esc), INTENT (INOUT) , POINTER :: MyJ)esc 

INTEGER, INTENT(IN) :: New.Transposition 

New.Transposition is one of the transposition specification pa- 
rameters: 

■ DFTP Jaram.TranspositionJlllow, or 

■ DFTPJ*aram.TranspositionJtvoid. 
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Table 3 Attributes of two-argument computational routines 



Type 


X 


Intent 


Type 


Y 


Intent 


Complex 


(double complex) 


In 


Complex 


(double complex) 


Out 


Complex 


(double complex) 


In 


Real 


(double precision) 


Out 


Real 


(double precision ) 


In 


Complex 


(double complex) 


Out 


Real 


(double precision ) 


In 


Real 


(double precision ) 


Out 


Real 


(double precision ) 


Inout 


Real 


(double precision ) 


Inout 



4.4. COMPUTATION ROUTINES 

The routine names are heavily overloaded. The number of parameters 
vary as well as the type and precision because of different domain and 
representation as well as the different situations between in-place and 
out-of-place interface. The first parameter is always the pointer to the 
descriptor whose intent is “inout.” The data and result parameters are 
all declared as assumed-size rank-1 array DIMENSI0N(0:*). The first 
form of the computation routine has two parameters following the de- 
scriptor. SUBROUTINE DFTPJ)FT.{Forward,Reverse}(My-Desc, X, Y) The pa- 
rameters can have the following attributes: 

The second form of the computation routines has four parameters 
following the descriptor. This is applicable to one case, namely, an 
out-of-place computation on transform on general domain where the 
representation is “RR.” 

SUBROUTINE DFTP_DFT.{Forwsu:d,Reverse}(My-Desc, X_re, X.im, Yjre, Y_im) 
TYPE(DFTPJ)esc) , INTENT(INOUT) , POINTER :: MyJesc 

REAL, INTENT(IN), DIMENSION (0 : ♦) :: X_re, X-im 
REAL, INTENT(OUT), DIMENSI0N(0:») :: Y_re, Y.im 
There is a corresponding routine for double precision as well. Finally, 
the thrid form of the computation routines has one parameter following 
the descriptor. These are all in-place routines. 

SUBROUTINE DFTP J)FT.{Forward, Reverse} (My .Desc, X) 

The parameters can have the following attributes: 

4.5. INQUIRY ROUTINES 

Finally, the module provides a set of inquiry routines in the form 
of function. One function is provided to return the error code. An 
error number of 0 implies normal completion. Meaning of non-zero error 
number can be implementation dependent. It is also possible that a 
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Table 4 Attributes of inplace routines 



Type 


Intent 


Complex (or double complex) 


Inout 


Real (or double precision) 


Inout 



Table 5 

Function Name 


Description of inquiry routines 

Return Value 


DFTP -Inquire -ErrorNum 


Integer scalar 


DFTP -Inquire -Dimens ion 


Integer scalar 


DFTP-Inquire-Periods 


pointer 


DFTP-InquireJPrecision 


scalar, DFTPACK parameters 


DFTP-Inquire-NumTransforms 


Integer scalar 


DFTP-InquireJ)omain 


scalar, DFTPACK parameters 


DFTP -Inquire -Rep-General 


scalar, DFTPACK parameters 


DFTP -Inquire -Rep-Real 


scalar, DFTPACK parameters 


DFTP-Inquire-Rep-ConEven 


scalar, DFTPACK parameters 


DFTP-Inquire-Stride-Input 


pointer 


DFTP-Inquire-Stride-Output 


pointer 


DFTP-Inquire-Workspace 


scalar, DFTPACK parameters 


DFTP-Inquire-Ordering 


scalar, DFTPACK parameters 


DFTP -Inquire-Digit -Reversal 


pointer 


DFTP -Inquire-Transposit ion 


scalar, DFTPACK parameters 



small set of error condition with associated non-zero error number can 
be made standard. Correspond to each option is an inquiry function that 
returns the current setting. The functions all take one input parameter 
which is the pointer to the current descriptor. On return, the function 
either returns an integer scalar, or a pointer to an integer array in the 
case where an array is needed for the return information. 

5. CONCLUSION 

We believe the proposed API achieves all the goals set out in Section 3. 
Our proposal is formulated in Fortran 90 together with a C binding that 
has an almost one-to-one correspondence with the Fortran counterpart. 
While the vast functionality supported requires a substaintial implemen- 
tation effort, the proposed options can be supported incrementally. A 
reference implementation effort is underway. 
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DISCUSSION 

Speaker: Ping Tak Peter Tang 

Robert van de Geijn : It seems to me that there really should be two 
descriptors. The first should describe the data, including length, storage 
method, and pointer to the actual data. The second should describe the 
operation to be performed. One benefit would be that one could take 
“views” of the data, which would clarify implementation. 

Ping TbJc Peter Tang : While further dividing the attributes of the 
problem and encapsulating them with more than one descriptor in prin- 
ciple allows for flexibility that may facilitate unforseen featmes that may 
be desired, the Discrete Fourier Transform is simple compared to, say, 
linear algebra, where many different operations (LU, SVD, Cholesky, 
etc.) might be applied to same data. Here we only have one operation, 
namely, the DFT. Doubtlessly, different algorithms will be used to efiect 
the transform depending on the properties of the data, but it is best to 
hide that from the user. 

Fred Gustavson : I applaud you for producing an API for the FFT. 
You have produced a very good design. Bernaxd Rudin, from IBM, and 
I suggested to Charles Van Loan the need for such an interface in 1995. 
Charles agreed. However, our effort did not materialize in an interface. 
Ping Tak Peter Tang : Thank you for the encouragement. I too feel 
strongly about a need for a common API. I will rely on colleagues like 
you to give me moral support and technical advice eis I, and possibly 
other collaborators proceed on this project. 

Anne Trefethen : Having a common API for DFT is important. Toge- 
her with the BLAS, DFTs are functionality often provided by hardware 
vendors. At NAG, we interface to the vendor routines when available. 
How do we get the hardware vendors to buy into this common API? 
Ping Tak Peter Tang : I plan to pursue two routes. First, a portable 
and public domain implementation of the API must be made available. 
This implementation must have a very reasonable performance com- 
pared to native implementations. Moreover, the software architecture 
must consist of components that vendors can easily replace with their 
existing routines with minimal effort. This will allow a vendor to take 
the reference implementation that supplies all the functionality and im- 
prove performance on the subset that the vendor has already optimized. 
Second, I intend to produce, and put into open source. Intel-specific 
implementations both for the 32-bit and 64-bit architectures. 

Richard Hanson : The use of a descriptor hides certain initialization 
steps. These include computing the twiddle factors (complex multipli- 
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ers). In threaded environments it seems that an initialization call to the 
descriptor logic will be necessary. 

Ping Tak Peter Tang : I am just beginning on some implementation 
effort. I believe the API design is sound and will allow correct and 
thread-safe implementation without performance penalty. I agree that 
we must have a thread-safe implementation. 
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Abstract Pthreads is the library of POSIX standard functions for concurrent, 
multithreaded programming. The POSIX standard only defines an ap- 
plication programming interface (API) to the C programming language, 
not to Fortran. Many scientific and engineering applications are written 
in Fortran. Also, many of these applications exhibit functional, or task- 
level, concurrency. They would benefit from multithreading, especially 
on symmetric multiprocessors (SMP). We summarize here an interface 
to that part of the Pthreads library that is compatible with standard 
Fortran. The contribution consists of two primary source files: a For- 
tran module and a collection of C wrappers to Pthreads functions. The 
Fortran module defines the data structures, interface and initialization 
routines used to manage threads. The stability and portability of the 
Fortran API to Pthreads is demonstrated using common mathematical 
computations on three different systems. 

This paper is a shortened and slightly modified version of a complete 
Algorithm submitted for publication to the journal ACM Trans. Math. 
Software, during July, 2000. 

Keywords: POSIX Threads, Fortran, scientific computing, symmetric multiproces- 
sor, mathematical software, barrier routine 

1. INTRODUCTION 

Pthreads is a POSIX standard library [6] for expressing concurrency 
on single processor computers and symmetric multiprocessors (SMPs). 
Typical multithreaded applications include operating systems, database 
search and majiipulation, and other transaction-based systems with 
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shared data. These programs are generally coded in C or C++. Hence 
the POSIX standard only defines a C interface to Pthreads. The lack of 
a Fortran interface has limited the use of Pthreads for scientific and nu- 
merically intensive applications. However, since many scientific compu- 
tations contain opportunities for exploiting functional, or task-level con- 
currency, certain Fortran applications would benefit fi:om multithread- 
ing. 

A thread represents an instruction stream executing within a single 
address space; multiple threads, of the same process, share this address 
space. Threads are sometimes called ‘lightweight’ processes because they 
share many of the properties and attributes of full processes but re- 
quire minimal system resources to maintain. When an operating system 
switches context between processes, the entire memory space of the exe- 
cuting process must be saved and the memory space of the process sched- 
uled for execution must be restored. When switching context between 
threads there is no need to save and restore large portions of memory 
because the threads are executing within the same memory space. This 
savings of system resources is a major advantage of using threads. 

The Pthreads library provides a means to control the spawning, ex- 
ecution, and termination of multiple threads within a single process. 
Concurrent tasks are mapped to threads. Threads within the same pro- 
cess have access to their own local, private memory but also share the 
memory space of the global process. Executing on SMPs, the system 
may execute threaded tasks in parallel. 

As useful as the Pthreads standard is for concurrent programming, a 
Fortran interface is not defined. The POSIX 1003.9 (FORTRAN Lan- 
guage) committee was tasked with creating a FORTRAN (77) definition 
to the base POSIX 1003.1-1990 standard [3]. There is no evidence of 
any POSIX standard work to produce a FORTRAN equivalent to the 
Pthread standard. Fortran 90 has addressed many shortcomings of FOR- 
TRAN 77 that may have prevented the formulation of such a standard. 
There are no serious technical barriers to implementing a workable API 
in Fortran 90. 

We review the implementation and testing of a Fortran API to 
Pthreads (referred to in what follows as FPTHRD). Our tests indi- 
cate that the API is standard-complying with Fortran 90 or Fortran 95 
compilers. For this reason we use “Fortran” to mean compliance with 
both standards. The following section gives some general information 
on the threaded program m ing model with a specific example taken from 
the POSIX library functions. More complete descriptions of the POSIX 
thread library can be found in [2], [7], and [8]. 
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In the complete paper [5] details of implementation of the Fortran API 
package to the Pthreads library are provided. Benchmarks are presented 
for threaded example problems and a comparison of their execution per- 
formance on three separate SMPs, each with a different native Pthreads 
implementation. These results show that thread programming has merit 
in terms of improving single processor performance on typical scientific 
computations. 

2. THREADED PROGRAMMING 
CONCEPTS 

Multithreading is a concurrent programming model. Threads may 
execute concurrently on a uniprocessor system. Parallel execution, how- 
ever, requires multiple processors sharing the same memory; i.e., SMP 
platforms. 

Threads perform concurrent execution at the task or function level. A 
single process composed of independent tasks may divide these compu- 
tations into a set of concurrently executing threads. Eax;h thread is an 
instruction stream, with its own stack, sharing the global memory space 
assigned to the process. Upon completion, the threads resources are re- 
covered by the system. All POSIX threads executing within a process 
are peers. Thus, any thread may cancel any other thread; any thread 
may wait for the completion of any other thread; and any thread may 
block any other thread from executing protected code segments. There is 
no explicit parent-child relationship unless the programmer specifically 
implements such an association. 

With separate threads executing within the same memory address 
space, there is the potential for memory access conflicts; i.e., write/write 
and read/write conflicts (also known as race conditions). Write/write 
conflicts arise when multiple threads attempt to concurrently write to 
the same memory location; read/ write conflicts arise when one thread is 
reading a memory location while another thread is concurrently writing 
to that same memory location. Since scheduling of threads is largely non- 
deterministic, the order of thread operations may differ from one execu- 
tion to the next. It is the responsibility of the programmer to recognize 
potential race conditions and control them. Fortunately, Pthreads pro- 
vides a mechanism to control access to shared, modifiable data. Locks, 
in the form of mutual exclusion (mutex) variables, prevent threads from 
entering critical regions of the program while the lock is held by another 
thread. Threads attempting to acquire a lock (i.e., enter a protected 
code region) will wait if another thread is already in the protected region. 
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Threads acquire and release locks using function calls to the Pthreads 
library. 

Pthreads provides an additional form of synchronization through con- 
dition variables. Threads may pause execution until they receive a signal 
from another thread that a particular condition has been met. Waiting 
and signaling is done using Pthreads function calls. 

2.1. POSIX CONSIDERATIONS 

The Pthreads header file for the C wrapper code (summary.h) con- 
tains system dependent definitions for data structures that are used with 
the Pthreads routines. These are typically C structures and are intended 
to be opaque to the programmer. Manipulation of and access to the 
contents of the structures should only be done through calls to the ap- 
propriate Pthreads functions. Since programmers do not need to deal 
with differences of structure definitions between platforms, this style en- 
ables codes to be portable. Standard names for error codes that can 
be returned from system calls are established by the POSIX standard. 
Integer valued constants axe defined with these standard names within 
system header files. As with Pthreads structures, the actual value of 
any given error code constant may change from one operating system 
to the next. The intention is to keep the specific values given to each 
error code hidden from the programmer. Thus, the programmer need 
only compare a function’s return value against the named constant to 
determine if a specific error condition has arisen. 

3. DESIGN AND IMPLEMENTATION OF 
FORTRAN API 

The Pthreads library is relatively small, consisting of only 61 routines 
that can loosely be classified into three categories: thread management, 
thread synchronization, and thread attributes control. Thread manage- 
ment functions deal with the creation, termination, and other manipula- 
tion of threads. The two methods available for guaranteeing the correct 
and synchronous execution of concurrent threads are mutex and con- 
dition variables. These constructs, and the functions to handle them, 
are used to ensure the integrity of data being shared by threads. The 
Pthreads standard defines attributes in order to control the execution 
characteristics of threads. Such attributes include detach state, stack 
address, stack size, scheduling policies, and execution priorities. 
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Table 1 Example code: Use of static initializers 
TYPE (fpthrd_mutex_t) any_mutex 
any_mutex = FPTHRD_MUTEX_INITIALIZER 



3.1. FORTRAN INTERFACE DETAILS 

The FPTHRD package consists of a Fortran module and file of C 
routines. The module defines Fortran derived types, parameters, inter- 
faces, and routines to allow Fortran programmers to use Pthread rou- 
tines. The C functions provide the interface from Fortran subroutine 
calls and map parameters into the corresponding POSIX routines and 
function arguments. Dependencies on compiler and Pthreads versions 
are managed within the C functions. 

The following sections describe some of the design decisions we 
faced and the similarities and differences between FPTHRD and the 
Pthreads standard. 

3.1.1 Naming Conventions. The names of the FPTHRD 
routines are derived from the Pthreads root names; i.e., the string fol- 
lowing the prefix pthread-. The string fpthrd- replaces this prefix. In 
this way, a call to the Pthreads function pthread_create() translates to 
a call to the Fortran subroutine fpthrd_create(). Our initial thoughts 
were to prefix the full POSIX names with the character f, which would 
yield the prefix string fpthread- before each root name. However, the 
Fortran standard [1] limits subroutine and variable names to 31 charac- 
ters. The longest POSIX defined name is 32 characters in length. Since 
the fpthrd- prefix yields a net loss of one character over the POSIX 
prefix, we can guarantee that routine names in our package will have no 
more than 31 characters. All the Fortran routine names are therefore 
standard-compliant and all the Pthreads root names remain intact. 

For consistency, all POSIX data types (Table 2 below) and defined 
constants (Table 2, [5]) prefixed with pthread- (PTHREAD-) are de- 
fined with the prefix fpthrd- (FPTHRD-) within the Fortran module. 
Besides those defined specifically for Pthreads types, other POSIX types 
are used as parameters to Pthreads functions. For these additional struc- 
tures a corresponding definition is included within the module with the 
prefix character f added to the POSIX name. 
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Table 2 Fortran derived types and Pthreewls structures 



Fortran Derived Type Name 


POSIX Structure Name 


TYPE (fpthrd_t) 


pthread_t 


TYPE (fpthrd_once_t) 


pthread-once.t 


TYPE (fpthrd_attr_t) 


pthread_attr_t 


TYPE (fpthrd_mutex_t) 


pthread-mutex-t 


TYPE (fpthrd_mutexattr_t) 


pthreadjnutexattr_t 


TYPE (fpthrd_cond_t) 


pthreadxond-t 


TYPE (fpthrd_condattr_t) 


pthreadxondattr.t 


TYPE (fsched-param) 


sched-param 


TYPE (ftimespec) 


timespec 


TYPE (fsize.t) 


size^t 



3.1.2 Structure Initialization. Besides the routines specifi- 
cally designed for initialization, the Pthreads library includes predefined 
constants that can be used to initialize mutexes, condition variables, 
and ‘once block’ structures to their default values. Corresponding de- 
rived type constants for initialization have been defined and included in 
FPTHRD. The type and names of these initialization constants for a 
condition variable, mutex variable, and ‘once block’ variable are: 

TYPE (fpthrd_cond_t) FPTHRD_COND_INITIALIZER 
TYPE (fpthrd_mutex_t) FPTHRD_MUTEX_INITIALIZER 
TYPE (fpthrd_once_t) FPTHRD_ONCE_INIT 

To use these initialized data types with default attributes, assign the 
value in a program unit with the assignment operator , viz. Table 1. 

3.1.3 Parameters. The Fortran API preserves the order of the 
arguments of the C functions and provides the C function value as the 
final argument. This style of using Fortran subroutines for corresponding 
C functions with the status argument appended to the parameter list is 
used in the Fortran API for both MPI [9] and PVM [4]. A return value of 
zero indicates that the routine call did not yield any exception; any non- 
zero return value indicates that an exception condition occurred during 
execution. Whether an exception condition is an error or can be ignored 
is determined in the context of the application. The POSIX standard 
defines names for specific conditions and requires fixed integer values 
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be attached to these error codes. The Fortran module defines integer 
constants with the same names as the POSIX standard for all potential 
error codes that might be returned from Pthre 2 ids functions. The values 
of these Fortran constants are the same as their POSIX counterparts on 
the target platform. The routines fpthrd_self() and fpthrd_equal() 
have no status argument since they do not return exception fiags. 

Fortran provides compile-time checking of argument type, number, 
kind, and rank using interface blocks. This is an advantage over the 
C programming language, which does not provide argument checking. 
Besides the compile-time checking, interface blocks also provide for argu- 
ment overloading. This feature allows the use of TYPE(C_NULL) pa- 
rameters where an optional NULL could be used in the underlying C func- 
tions. Fortran interface blocks also make it possible for the status param- 
eter to be optional in Fortran routine calls. The module in our package 
provides interface blocks for the Fortran routines that call corresponding 
C functions with the exception of routine fpthrd_create(). Since the 
argument type for the threaded subroutine is chosen by the program 
author, it is necessary to exclude type checking for fpthrd_create(). 
The status parameter is not optional in calls to this routine. 

3.1.4 fpthrd_join() Parameters. One special case should 
be mentioned with respect to parameters. The second parameter of 
fpthrdJoin() is used to return an exit code from the fpthrd_exit() 
call of the thread being joined. The Pthreads library uses a C language 
void** type to allow the return of any defined data value or C struc- 
ture.If no value is expected or needed by the joining thread, a NULL value 
may be used. Due to the difference in the way C and Fortran pointers are 
implemented (see 3.4 What’s Not Included in the Package, for further 
discussion) and the desire to keep the programming of the interface as 
simple as possible, it was decided to restrict the type of this parameter 
to INTE^filR. 

This type restriction is repeated in the single parameter of the 
fpthrd_exit() routine which generates the value. Within scientific ap- 
plications, it was thought that this exit value would be used mostly for 
returning a completion code to the joining thread. Special codes could 
be designed to signify the success or failure, and the cause of any failure, 
of the joined thread. Should more elaborate data structures be required 
to be passed from a thread to that thread which joins it, the integer 
value can be used as a unique index into a global array of results. 
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3.2. SUPPORT AND UTILITY ROUTINES 

This section contains details on several routines that are not in- 
cluded in the Pthreads standard. These routines have been included in 
FPTHRD to provide the programmer the ability to give the runtime 
system a hint as to the number of active threads desired, to initialize 
the Fortran API routines and check parameter values and derived type 
sizes, and to manipulate POSIX defined data types required by certain 
Pthreads functions for which there is no Fortran compliant method gen- 
erally available. 

Many systems that support multithreading have an included func- 
tion to inform the runtime system of the number of threads the sys- 
tem should execute concurrently. This seems to be particularly rel- 
evant for uniprocessor systems and is intended to allow finer control 
of system resources by the thread programmer. We have included 
the fpthrd_setconcurrency() and fpthrd_getconcurrency() sub- 
routines in FPTHRD to give the programmer a chance to request the 
number of kernel entities that should be devoted to threads; i.e., the 
number of threads to be executing concurrently. If the target platform 
does not support this functionality, calls to these routines will return 
without altering anything. 

An initial data exchange is required cis a first program 
step before using other routines in FPTHRD. The routine 
fpthrd_data_exchange() is used as an initialization for the FPTHRD 
library. This routine is similar in functionality to the MPI_INIT() rou- 
tine from MPI. The data exchange was found to be necessary because 
the parameters defined in Fortran or constants defined in C are not di- 
rectly accessible in the alternate language. One such value of note is 
the parameter NULL passed from Fortran to C routines. This integer is 
used as a signal within the C wrapper code to substitute a NULL pointer 
for the corresponding function argument. The derived type TYPE(C 
NULL), while available to programmers, is not meant for use except to 
define the special parameter value NULL. 

The working space for the C structures of Pthreads data types are de- 
clared as Fortran derived types. Each of the definitions for derived types 
is an integer array with the PRIVATE attribute. Pthreads structures 
are opaque. The PRIVATE attribute prevents the Fortran program 
from inadvertently accessing the data in these structures. One other 
task performed by the fpthrd_data_exchange() routine is to ensure 
that the Fortran derived types are of sufficient size to hold the corre- 
sponding C structures. 
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Table 3 Exjimple code: Use of error codes 
DO 

CALL fpthrd_create(tid, NULL, thread.routine , routine_arg, ierr) 
IF (ierr /= EAGAIN) EXIT 
CALL wait_some_random_time 
END DO 



Five additional routines are included to give the programmer the 
ability to manipulate those C structures used by Pthreads that are 
not a direct part of the Pthreads definition. The Fortran names 
defined in FPTHRD for these data types are TYPE(fsize t), 
TYPE(ftimespec), and TYPE(fsched param) (as shown in Table 
2). For these data types there are routines to set and retrieve values 
from objects of each type. 

3.3. ERROR CHECKING 

The POSIX standard defines a set of error codes that may be returned 
from calls to Pthreads functions that signal when exceptioneil conditions 
have occurred. These exception codes are available from the routines 
in FPTHRD through the optional status parameter. Examination of 
the returned value of the status parameter allows codes to dynamically 
react to possibly fatal conditions that may axise during execution. 

As an example, consider a code that requires the creation of a large 
number of threads. During execution, resources may be temporarily 
unavailable to create new threads. Rather than abort the entire com- 
putation at this exception, it would be prudent to pause the creation of 
new threads until resources become available, since this is deemed cer- 
tain to happen. In the event that the fpthrd_create() status parameter 
return value be equal to the EAGAIN error constant, the spawning thread 
would wait for some amount of time before attempting to create another 
thread (Table 3). As long as the EAGAIN exception value is returned from 
fpthrd_create(), the spawning thread will continue to wait before at- 
tempting to create the new thread. 

While each platform may have different values for EAGAIN and all 
other error constants, the initial data exchange routine accounts for these 
differences. All the programmer needs to do is use the symbolic name; 
e.g., EAGAIN. The possible error constants that may be returned from 
each routine in FPTHRD are detailed in the documentation for earh 
Pthreads routine. 
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Since most routines in FPTHRD have several possible exception 
codes, rather than specifically check for each one, a method to print out 
what exception code was retmned may be desired. This is especially 
true when debugging threaded applications. FPTHRD contains the 
routine, ferr .abort (), that provides the functionality described above 
as well as aborting further processing by all threads. A brief description 
of the ferr_abort() subroutine and its parameters is given below. 

SUBROUTINE ferr_abort(sequence_number, status, text.string) 

INTEGER, INTENT (IN) :: sequence.number 
INTEGER, INTENT(IN) :: status 

CHARACTER, DIMENSION(*) , INTENT(IN) :: text.string 

The sequence number is an arbitrary identifying integer printed with 
the error message. The status argument is a variable holding the excep- 
tion code value returned from a prior call to some routine in FPTHRD. 
If the status value is non-zero, a message containing the corresponding 
error constant is printed along with the text of the third argument. A 
Fortran STOP ’ Abort ’ is also executed to terminate the computation. 
If the status value is zero, no action is taken by the ferr.abort() rou- 
tine. Thus, it is safe (and very wise) to insert calls to ferr.abort() after 
calls to FPTHRD routines when fatal errors are possible. Where it is 
possible that non-fatal exceptions may be encountered, these should be 
dealt with directly by the application code. 

3.4. WHAT’S NOT INCLUDED IN THE 
PACKAGE 

The functionality of several routines included in the Pthreads library 
is outside the scope of Fortran. We describe these functions in this 
section and state our reasons for their exclusion from FPTHRD. 

The functions pthread.cleanup.push() and 

pthread.cleanup.popO allow the programmer to place and re- 
move function calls into a stack structure. Should a thread be cancelled 
before the corresponding pop calls have been executed, the functions 
are removed from the stack and executed. In this way, threads are 
able to “cleanup” details such as freeing allocated memory or acquired 
mutex variables even if normal termination is thwarted. 

While the functionality of these routines is desirable, they are typically 
implemented as macros defined in the pthread.h header file in order to 
ensure paired push and pop calls. Upon further examination, we have 
found undocumented system calls and data structures used within these 
macros. Since the targets for FPTHRD are scientific computation and 
numerical codes, it was concluded that such functionality might not be 
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as useful as other functions. With that in mind, it was decided the effort 
required to develop a simple, general algorithm to implement equivalent 
cleanup functions in Fortran outweighed the potential benefit. 

The heart of the problem for the functions associated with 
thread-specific data: pthread-getspecific(), pthread_key_create(), 
pthreadJkey_delete(), and pthreadjsetspecific() is illustrated by 
pthread_getspecific(). This function returns a void C pointer to a 
data object associated with the calling thread. This allows local data 
pertinent to a user’s threaded function to be available before the thread 
terminates. Fortran defines a pointer attribute for intrinsic and derived 
types (from [1]): 

a pointer is a descriptor with space to contain information about 
the type, type parameters, rank, extents, eind location of a target. Thus 
a pointer to a scalar object of type real would be quite different from a 
pointer to an array of user-defined type. In fact each of these pointers 
is considered to occupy a different unspecified storage unit.” 

A C pointer is simply a memory address. As evidenced from the above 
passage, Fortran cannot access or manipulate memory addresses directly. 
In other words, the C and Fortran languages share the word ’pointer’ but 
not the logical content. At this time, we can find no portable way to im- 
plement the thread-specific data functions without imposing obstructive 
constraints. 

Functions pthread_attr_getstackaddr() and 

pthread_attr_setstackaddr() manipulate a thread’s staek address. 
As stated previously, Fortran has no facility to directly manipulate 
memory addresses. Besides, implementation details, such as a stack, 
are not addressed in the Fortran standard. Thus, there is a danger that 
a Fortran program that calls these routines may not recognize setting 
of the stack address. 

The pthread^tforkO, pthread_kill(), and pthread_sigmask() 
functions deal with the forkO function and inter-thread signalling. 
Since support for these features from Fortran programs within the run- 
time system is unknown and perhaps even unsupported, especially be- 
tween different operating systems, these functions axe not included in 
FPTHRD. 

3.5. PACKAGE CONTENTS 

The FPTHRD API sample implementation consists of a Fortran 
module, fpthrd.f90, and a file of C functions, ptf90.c, together with 
an include or header file, summaury.h. We also have included four 
test/verification programs, timing programs for matrix- vector product 
and matrix transpose, and other documentation. 
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4. EXAMPLE OF A BARRIER ROUTINE 
FOR THREADS 

A routine often needed in scientific applications is thread synchroniza- 
tion or the use of a barrier routine. This need arises since concurrency is 
often ubiquitous in a computation, but the results cannot be used until 
all the participating threads have completed their work. 

The Pthreads library, and the provided FPTHRD API, operates at 
a lower level of functionality than a barrier function. In fact a typical 
barrier is shown below, constructed fi:om FPTHRD routines. Note that 
the example uses a mutex or lock variable (LOCK), a condition variable 
(COND), a simple counter (COUNTER), and a logical switch (CYCLE). The 
value NTHREADS is the number of threads to synchronize, which is set 
before the routine is called. The routine also relies on an external dummy 
routine (VOLATILEO) that is intended to force a loaid from memory for 
the argument. 

Here are remarks about the codes given in Table 4 

■ The FPTHRD variables LOCK, COND, and the integer variables 
CYCLE, SWITCH, and NTHREADS, are declared in a module that con- 
tains the routine SYNC_ALL(). The variables COND and LOCK 
must be initialized (not shown here); the value of COUNTER is ini- 
tialized to NTHREADS; and the value of CYCLE is initialized to zero 
(also not shown). 

■ Each thread stores a local copy of CYCLE as LOCAL-CYCLE. (Line 5) 

■ If there is one thread, there is no need to synchronize further, and 
the routine exits immediately. (Lines 4 &; 11) 

■ Each thread, as it acquires the lock, decrements the value of 
COUNTER. One distinguished thread will decrease COUNTER to the 
value of zero. (Lines 6-7) 

■ The distinguished thread signals the alternate threads that 
COUNTER now has the value zero. It prepares the routine for fur- 
ther calls (Lines 9 & 13) and broadcasts a synchronous signal to 
the alternate threads that “individually wakes them up” with the 
mutex locked. (Line 16) 

■ The distinguished thread changes the value of CYCLE, (Line 13). 
This step is critically important, as explained next, in addition to 
preparing the routine for later calls. 

■ When the signal is broadcast, an alternate thread may not have 
yet entered the routine FPTHRD_COND_WAIT(). Testing 
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Table 4 Typical barrier routine 

1. SUBROUTINE SYNC^ALLO 

2. INTEGER LOCAL.CYCLE 

3. ! START of basic barrier code: 

4. IF(NTHREADS > 1) call fpthrd.mutex_lock(LOCK) 

5 . LOCAL_CYCLE»CYCLE 

6. C0UNTER=C0UNTER-1 

7. IF (COUNTER ** 0) THEN 

8. ! Reset counter to number of threads. 

9. COUNTER»NTHREADS 

10. ! When there is only one thread, synchronizing is not required. 

11. IF(NTHREADS *= 1) RETURN 

12. ! Throw switch in alternate direction. 

13. CYCLE=1-CYCLE 

14. ! These steps prepare the routine for another use. 

15. ! The distinguished thread wakes up the waiting threads. 

16. call f pthrd.cond.broadcast (COND) 

17. END IF 

18. ! Waiting threads wake up, or see if the switch has changed. 

19. ! The use of this function to test the value of CYCLE prevents 

20. ! compiler optimization from using the wrong value of CYCLE. 

21. I An alternate thread will change the value, and each thread must 

22. ! fetch its current value every time the test is made. 

23. DO WHILE (VOLATILE (CYCLE) == LOCAL.CYCLE) 

24. call fpthrd.cond.wait (COND, LOCK) 

25. END DO 

26. call fpthrd.mutex_unlock(LOCK) ! END of basic baarrier code. 

27. END SUBROUTINE 

28. FUNCTION VOLATILE (INT) 

29. ! This external dummy function is intended to prevent code 

30. ! optimization from removing a critical test. It does not do 

31. ! anything except prevent compiler optimization errors. 

32. INTEGER VOLATILE 

33. INTEGER, INTENT (INOUT) :: INT 

34. V0LATILE=INT 

35. INT=V0LATILE 

36. END FUNCTION 
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the value of CYCLE protects against this and avoids missing the sig- 
nal. Another possibility is that alternate threads may spuriously 
wake up — returning from routine FPTHRD_COND_WAIT() 
— and find this signal is not intended for them. If the 
value of CYCLE has not changed they again call routine 
FPTHRD_COND_WAIT() , unlock the mutex, and continue 
waiting, (Lines 23-25). The use of an external dummy integer 
subprogram, VOLATILE(), is intended to guard against aggres- 
sive compiler optimization. For example, without the use of this 
function, the machine code could keep the value of CYCLE and LO- 
CAL-CYCLE within integer registers. When the wakeup signal is 
broadcast, the test (Line 23) would always be satisfied with the 
register values and thus appear to be a spurious wakeup signal. 
This would cause the program to fail since another signal would 
not appear. 
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DISCUSSION 

Speaker: Richard Hanson 

Brian T. Smith : What were the comparisons with transpose and 
matrix multiply? Were the intrinsics threaded by the supplier or were 
they serial? 

Richard Hanson : These are comparisons of “ordinary code” with the 
array intrinsics TRANSPOSE and MATMAL. One expects the vendors to do 
a good job on these operations. In fact, at this problem size, threading 
is more effective on all the platforms we have ported to. 

John Rice : Where is the gain of efficiency made in the threads imple- 
mentation of the transpose? 

Richard Hanson : Quite probably each computational unit is using 
its cache and the system resomces concmrently and efficiently. 
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Abstract Scientific applications now have data management requirements that 
extend beyond the local computing environment. Scientific disciplines 
are assembling data collections that represent the primary source data 
for current research. SDSC has developed data management systems to 
facilitate use of published digital objects. The associated infrastructure 
includes persistent archives for managing technology evolution, data 
handling systems for collection-based access to data, collection manage- 
ment systems for organizing information catalogs, digital library ser- 
vices for manipulating data sets, and data grids for federating multiple 
collections. The infrastructure components can be characterized as in- 
teroperability systems for digital object management, information man- 
agement, and knowledge management. Examples of the application of 
the technology include distributed collections and data grids for astro- 
nomical sky surveys, high energy physics data collections, and art image 
digital libraries. 

Keywords: Data handling systems, digital libraries, grids, persistent archives 

1. INTRODUCTION 

Scientific disciplines are starting to assemble primary source data for 
use by researchers [7]. The data are typically distributed across mul- 
tiple administration domains and are stored on heterogeneous storage 
systems. The challenge is to facilitate the organization of these primary 
data resources into collections without compromising local control. At 
the same time, middleware is needed to support uniform access to the 
data sets, including APIs for direct application discovery and manipu- 
lation of the data, command line interfaces for accessing data sets from 
scripts, and web GUIs for interactive browsing and presentation of data 
sets. 
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The development of infrastructure to support the publication of sci- 
entific data must recognize that information repositories and knowledge 
bases are also needed. One can difierentiate between infrastructure com- 
ponents that provide: 

■ Data storage of digital objects that axe either simulation output 
or remote sensing data. The digital objects axe representations of 
reality, generated either through a hardware remote sensing device 
or by execution of an application. 

■ Information repositories that store attributes about the digital ob- 
jects. The attributes are typically stored as metadata in a catalog 
or database. 

■ Knowledge bases that characterize relationships between sets of 
metadata. An example is rule-based ontology mapping [6] that 
provides the ability to correlate information stored in multiple 
metadata catalogs. 

A scientific data publication system will need to support ingestion 
of digital objects, querying of metadata catalogs to identify objects of 
interest, and integration of responses aeross multiple information repos- 
itories. Fortunately, a rapid convergence of information management 
technology and data handling systems is occurring for the support of 
scientific data collections. The goal is to provide mechanisms for the 
publication of scientific data for use by an entire research community. 
The approach used at the San Diego Supercomputer Center is to orga- 
nize distributed data sets through creation of a logical collection [9]. The 
ownership of the data sets is assigned to the collection, and a data han- 
dling system is used to create, move, copy, replicate, and read collection 
data sets. Since all accesses to the collection data sets are done through 
the data handling system, it then becomes possible to put the data sets 
under strict management control, and implement features such as ac- 
cess control lists, usage audit trails, replica management, and persistent 
identifiers. 

Effectively, a distributed collection can be created in which the local 
resources remain under the control of the local site, but the data sets 
are managed by the global logical collection. Researchers authenticate 
themselves to the collection, and the collection in turn authenticates 
itself to the distributed storage systems on which the data sets reside. 
The collection maxiages the access control lists for each data set indepen- 
dently of the local site. The local resources are effectively encapsulated 
into a collection service, removing the need for researchers to have user 
accounts at each site where the data sets are stored. 
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The data handling system serves as an interoperability mechanism for 
managing storage systems. Instead of directly storing digital objects in 
an archive or file system, the interposition of a data handling system 
allows the creation of a collection that spans multiple storage systems. 
It is then possible to automate the creation of a replica in an archival 
storage system, cache a copy of a digital object onto a local disk, and 
support the remote manipulation of the digital object. The creation of 
data handling systems for collection-based access to published scientific 
data sets makes it possible to automate all data management tasks. In 
turn, this makes it possible to support data mining against collections of 
data sets, including comparisons between simulation and measurement, 
and statistical analysis of the properties of multiple data sets. Data set 
handling systems can be characterized as interoperability mechanisms 
that integrate local data resources into global resources. The interoper- 
ability mechanisms include 

■ inter-domain authentication, 

■ transparent protocol conversion for access to all storage systems, 

■ global persistent identifiers that are location and protocol indepen- 
dent, 

■ replica management for cached and archived copies, 

■ container technology to optimize archival storage performance and 
co-locate small data sets, and 

■ tools for uniform collection management across file systems, 
databases, and archives. 

2. DATA HANDLING INFRASTRUCTURE 

The data management infrastructure is based upon technology from 
multiple communities that are developing archival storage systems, par- 
allel and XML [3] database management systems, digital library services, 
distributed computing environments, and persistent archives. The com- 
bination of these systems is resulting in the ability to describe, manage, 
access, and build very large distributed scientific data collections. Sev- 
eral key factors are driving the technology convergence: 

■ Development of an appropriate information exchange protocol and 
information tagging model. The ability to tag the information 
content makes it possible to directly manipulate information. The 
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extensible Markup Language (XML) provides a common infor- 
mation model for tagging data set context and provenance. Docu- 
ment Type Definitions (and related organizational methods such as 
XML Schema [10]) provide a way to organize the tagged attributes. 
Currently, each discipline is developing their own markup language 
(set of attributes) for describing their domain-specific information. 
The library community has developed some generic attribute sets 
such as the Dublin core to describe provenance information. The 
combination of the Dublin core metadata and discipline specific 
metadata can be used to describe scientific data sets. 

■ Differentiation between the physical organization of a collection 
(conceptually the table structures used to store attributes in 
object-relational databases) and the logical organization of a col- 
lection (the schema). If both contexts are published, it becomes 
possible to automate the generation of the SQL commands used to 
query relational databases. For XML-based collections, the emer- 
gence of XML Matching and Structuring languages [1] makes it 
possible to construct queries based upon specification of attributes 
within XML DTDs. Thus attribute-based identification of data 
sets no longer requires the ability to generate SQL or XQL com- 
mands from within an application. 

■ Differentiation of the organization and access mechanisms for a 
logical collection from the organization and access mechanisms re- 
quired by a particular storage system. Conceptually, data handling 
systems store data in storage systems rather than storage devices. 
By keeping the collection context independent of the physical stor- 
age devices, and providing interoperability mechanisms for data 
movement between storage systems, logical data set collections can 
be created across any type of storage system. Existing data collec- 
tions can be transparently incorporated into the logical collection. 
The only requirement is that the logical collection be given ac- 
cess control permissions for the local data sets. The data handling 
system becomes the unifying middleware for access to distributed 
data sets. 

■ Differentiation of the management of information repositories from 
the storage of metadata into a catalog. Information management 
systems provide the ability to manage databases. It is then possi- 
ble to migrate metadata catalogs between database instantiations, 
extend the schema used to organize the catalogs, and export meta- 
data as XML or HTML formatted files. 
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The ability to manipulate data sets through collection-based access 
mechanisms enables the federation of data collections and the creation 
of persistent archives. Federation is enabled by publishing the schema 
used to organize a collection as an XML DTD. Information discovery can 
then be done through queries based upon the semi-structured representa- 
tion of the collection attributes provided by the XML DTD. Distributed 
queries across multiple collections can be accomplished by mapping be- 
tween the multiple DTDs, either through use of rules-based ontology 
mapping, or token-based attribute mapping. 

Persistent archives can be enabled by archiving the context that de- 
fines both the physical and logical collection organization along with 
the data sets that comprise the collection [8]. The collection context 
can then be used to recreate the collection on new database technology 
through an instantiation program. This maJces it possible to migrate a 
collection forward in time onto new technology. The collection descrip- 
tion is instantiated on the new technology, while the data sets remain 
on the physical storage resource. The collection instantiation program 
is updated as database technology evolves, while the archived data re- 
mains under the control of the data handling system. As the archive 
technology evolves, new drivers are added to the data handling system 
to interoperate with the new data access protocols. 

The implementation of information management technology needs to 
build upon the information models and manipulation capabilities that 
are coming from the Digital Library community, and the remote data 
access and procedure execution support that is coming from the dis- 
tributed computing community. The Data Access Working Group of 
the Grid Forum [5] is promoting the development of standard imple- 
mentation practices for the construction of grids. Grids are inherently 
distributed systems that tie together data, compute, and visualization 
resources. Researchers rely on the grid to support all aspects of in- 
formation management and data manipulation. An end-to-end system 
provides support for: 

■ Knowledge discovery - ability to identify relationships between 
digital objects stored in different discipline collections 

■ Information discovery - ability to query aeross multiple informa- 
tion repositories to identify data sets of interest 

■ Data handling - ability to read data from a remote site for use 
within an application 

■ Remote processing - ability to filter or subset a data set before 
transmission over the network 
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■ Publication - ability to add data sets to collections for use by other 
researchers 

■ Analysis - ability to use data in scientific simulations, or for data 
mining, or for creation of new data collections 

These services are implemented as middleware that hide the com- 
plexity of the diverse distributed heterogeneous resources that comprise 
data and compute grids [4]. The services provide four key functionalities 
or transparencies that simplify the complexity of accessing distributed 
heterogeneous systems. 

■ Name transparency - Unique names for data sets are needed to 
guar£intee a specific data set can be found and retrieved. However, 
it is not possible to know the unique name of every data set that 
can be accessed within a data grid (potentially billions of objects). 
Attribute based access is used so that any data set can be iden- 
tified either by data handling system attributes, or Dublin core 
provenance attributes, or discipline specific attributes. 

■ Location transparency - Given the identification of a desired data 
set, a data handling system manages interactions with the possibly 
remote data set. The actual location of the data set can be main- 
tained as part of the data handling system attributes. This makes 
it possible to automate remote data access. When data sets are 
replicated across multiple sites, attribute-based access is essential 
to allow the data handling system to retrieve the “closest” copy. 

■ Protocol transparency - Data grids provide access to hetero- 
geneous data resources, including file systems, databases, and 
archives. The data handling system can use attributes stored in the 
collection catalog to determine the particular access protocol re- 
quired to retrieve the desired data set. For heterogeneous systems, 
servers can be installed on each storage resource to automate the 
protocol conversion. Then an application can £iccess objects stored 
in a database or in an archive through a uniform user interface. 

■ Time transparency - At least five mechanisms can be used to min- 
imize retrieval time for distributed objects: data caching, data 
replication, data aggregation, parallel I/O, and remote data filter- 
ing. Each of these mechanisms can be automated as part of the 
data handling system. Data caching can be automated by having 
the data handling system pull data from the remote archive to a lo- 
cal data cache. Data replication across multiple storage resomces 
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can be used to minimize wide area network traffic. Data aggrega- 
tion through the use of containers can be used to minimize access 
latency to archives or remote storage systems. Parallel I/O can 
be used to minimize the time needed to transfer a large data set. 
Remote data filtering can be used to minimize the amount of data 
that must be moved. This latter capability requires the ability to 
support remote procedure execution at the storage resource. 

3. APPLICATION 

A collection-based data management system has the following soft- 
ware infrastructme layers: 

■ Data Grid - for federation of access to multiple data collections 
and digital librajies 

■ Digital library - provide services for discovering, manipulating, 
and presenting data from collections 

■ Data collection - provide support for extensible, dynamically 
changing organizations of data sets 

■ Data handling system - provide persistent IDs for collection-based 
access to data sets 

■ Persistent axchive - provide collection-based storage of data sets, 
with the ability to handle evolution of the software infrastructure. 

The essential infrastructure component is the data handling system. 
It is possible to use data handling systems to assemble distributed data 
collections, integrate digital libraries with archival storage systems, fed- 
erate multiple collections into a data grid, and create persistent archives. 
An example that encompasses all of these cases is the 2-Micron All Sky 
Survey image archive. The 2MASS survey is an astronomy project led 
by the University of Massachusetts and Caltech to assemble a catalog of 
all stellar objects that are visible at the 2-micron wavelength. The goal 
of the project is to provide a catalog that lists attributes of each object, 
such as brightness and location. The final catalog can contain as many 
as 2 billion stars and 200 million galaxies. Of interest to astronomers 
is the ability to analyze the images of all the galaxies. This is a mas- 
sive data analysis problem since there will be a total of 5 million images 
comprising 10 terabytes of data. 

A collaboration between IPAC at Caltech and the NPACI program 
at SDSC is building an image catalog of all of the 2MASS observations. 
A digital library is being created at Caltech that records which image 
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contains each galaxy. The images are sorted into 147,000 containers to 
co-locate all images from the same axea in the sky. The image collection 
is then replicated between two archives to provide disaster recovery. The 
SDSC Storage Resource Broker [2] is used as the data handling system 
to provide axjcess to archives at Caltech and SDSC where the images are 
replicated. 

The usage model supports the following access scenario: 

■ Astronomers access the catalog at Caltech to identify galaxy types 
of interest. 

■ Digital library procedures determine which images need to be re- 
trieved. 

■ The data handling system maps the image to the appropriate con- 
tainer, retrieves the container from the archive, and caches the 
container on a disk. 

■ The desired image is then read from the disk and returned to the 
user through the digital library. 

Since the images axe accessed through the data handling system, the 
desired images can be retrieved from either archive depending upon load 
or availability. If the container has alreeidy been migrated to disk cache, 
the data handling system can immediately retrieve the image from disk 
avoiding the access latency inherent in reading from the archive tape 
system. If one archive is inaccessible, the data handling system auto- 
matically defaults to the alternate storage system. If data is migrated 
to alternate storage systems, the persistent identifier remains the same 
and the data handling system adds the location and protocol access 
metadata for the new storage system. The system incorporates persis- 
tent archive technology, data handling systems, collection management 
tools, and digital library services in order to support analysis of galactic 
images. 

Given the ability to access a 10-terabyte collection that contains a 
complete sky survey at the 2-micron wavelength, it is then possible to 
do data intensive analyses on terascale computer platforms . The data 
rates that are supported by a teraflops-capable computer will be over 100 
Megabytes per second. It will be possible to read the entire collection 
in 30 hours. During this time period, over ten billion operations can be 
done on each of the five million images. Effectively, the entire survey 
can be analyzed in a “typical” computation on a terafiops computer. 




Data Management Systems for Scientific Applications 281 



4. SUMMARY 

Data intensive computing is facilitated by the organization of scien- 
tific data into collections that can then be processed by the scientific 
community. In the long run, the utility of scientific computation will 
be measured by the publication of the results of the computation into 
collections for use by the rest of the community. Digital objects that 
remain as local files on a researcher’s workstation will be of little use 
to the scientific discipline. The utility of digital objects will be directly 
related to the specification of their context through membership in a sci- 
entific data collection. At this point, the fundamental ax:cess paradigm 
will shift to reading and writing data from collections, with applications 
using APIs to discover, access, and manipulate collection-based data. 
Each discipline will use their data repositories, information catalogs, 
and knowledge bases to provide direct access to all of the primary data 
sources for their domain. 
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DISCUSSION 

Speaker: Reagan Moore 

Richard Fateman : Who pays for storage and access of these collec- 
tions? How is annotation linked to data? 

Reagan Moore : The collections are the result of community-wide ef- 
forts within scientific domains. An example is the National Virtual Ob- 
servatory, which plans the integration of the major all sky survey image 
collections. The funding will come from their sponsoring agencies, in- 
cluding NSF and NASA. The users of the collections will be astronomers. 
Access to collections is supported through an application programmer’s 
interface, a set of Unix command-line utilities, and a Java/NT browser. 
The NT browser can be executed from a CDROM, making it possible 
to load a client in a few minutes. The SDSC Storage Resource Bro- 
ker (SRB) client runs on Linux, Unix workstations, Windows NT, and 
supercomputers. 

David W. Walker : To what extent are the storage and data man- 
agement systems you have described in production use by the end-user 
scientific community? 

Reagan Moore : The major uses of the SDSC SRB are scientific com- 
munities that are assembling discipline-wide scientific collections. The 
largest production system is for the 2MASS two-micron all sky survey. 
Over 3 million images (6 terabytes of data) have been ingested into a 
collection, with the complete image set to be imported by January 2001. 
The SRB is installed at 35 sites, and has over 150 registered users. 
Fred Gustavson : My question concerns efficiency. What scientific 
methods are being used by your systems for identifying and classifying 
data so that it is well organized to perform a large scientific computation 
on that data. An example would be the computational fluid dynamics 
computation you spoke about on historical data using new computa- 
tional algorithms for computational fluid dynamics. 

Reagan Moore : The organization of digital objects into a collection 
depends upon consensus within a scientific domain for the appropriate 
metadata. In the use of the Data Assimilation Office, the input data 
sets are observed data that have been organized for a daily analysis. 
The knowledge needed to use the data is embedded in the application. 
A simple unique label for each data object is the date of the assimilation 
run. An analysis requires retrieval and use of each data set for the time 
period of interest. More general characterizations are needed when the 
collection must support multiple applications. In particular, as new ap- 
plications are developed, additional metadata attributes may be needed. 
For any collection it may be necessary to extend the schema to include 
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the new attributes. Current technology supports the export and import 
of metadata from databases using XML DTDs, making this easy to do. 
Margaret Wright : What is the role of data compression? Where does 
it fit within the overall structure that you described? 

Reagan Moore : Data compression can give compression factors of 
1.5 for lossless algorithms, and factors of 10-100 for lossy algorithms. 
In practice, thumbnail (browse) images are created to help with dis- 
covery, and then the high resolution, lossless data objects are accessed. 
Compression can be supported at the remote site through proxy proce- 
dures. 
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1. APPLICATION DEVELOPMENT 
REQUIREMENTS 

In this paper we present some of the components that are provided by 
NAG and an infrastructure that can harness those components to pro- 
vide a rich application development environment. Application develop- 
ers have a general set of needs which include computational components, 
visualisation, access to data storage resources and often the ability to 
interact with the simulation or application during the course of a run. 
In the following sections we will discuss computational components de- 
veloped by NAG and a system to provide integrated visualisation in 
an application, together with examples of such application development 
building toolkits. 

The examples axe taken from three European funded projects that 
have explored these issues in different application areas. 



2. COMPUTATIONAL COMPONENTS 

NAG has been involved in high-performance libraries for many years. 
Beginning in the early seventies on vectorization of routines and the 
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definition of the BLAS [1], [2] [3], continuing through the eighties and 
nineties in projects such as LAPACK [4]and ScaLAPACK [5] and the 
development of their own parallel and SMP libraries. 

In developing libraries we need to consider always the needs of the ap- 
plication developers. These vary according to the computational prob- 
lems that they are developing, the software environment in which they 
work, and the computing architectures that are available to them. When 
NAG first became involved in the development of high-performance nu- 
merical libraries the languages were FORTRAN, the computing archi- 
tecture was a mainframe, workstation or for a few lucky individuals, 
the Cray Supercomputers. Today, the scene has changed dramatically 
and continues to evolve at a rapid rate. In order to provide the kind of 
computational support that application developers have come to expect 
we need to create plug-and-play libraries that can adapt to the appro- 
priate hardware/software environment. This implies algorithms which 
C£m take advantage of the architecture but remain portable, in languages 
that can interoperate with a number of possible application languages 
and programming paradigms. 

Over time NAG has developed many of the components to allow this 
flexibility of programming environment. These include libraries in FOR- 
TRAN [6],[7]and C [8] together with high-performance libraries for both 
shared memory [9] and distributed memory machines [10]. The algo- 
rithms included in these libraries cover most areas of numerical ainalysis 
and statistics. 

The libraries have become the backbone of many industrial and 
research acpplications. They have been developed with traditional 
programing styles in mind. In the examples below, however, we show 
how these libraries of routines can be thought of as collections of com- 
putational components that can be integrated into modern interactive 
environments. 



2.1. PROVIDING AN INFRASTRUCTURE 

So far we have talked only about the computational components pro- 
vided by the NAG libraries. It is just as important to be able to harness 
these components across distributed resources. In providing application 
development environments it is important that not only do they meet 
the ease-of-use requirements, the ability to run on distributed platforms, 
but also provide visualisation and in some cases the ability to interact 
with the application. 
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NAG develops the IRIS Explorer system [11] which although histor- 
ically a data flow visualisation system, more and more is proving in- 
valuable as the infrastructure for computational problems. The IRIS 
Explorer environment provides the user with a number of attractive 
features, including, the ease-of-use visual programming environment - 
modules of the application are connected by simple pipes, the capabil- 
ity to interact with the application, the module builder which acts as 
a gateway to Fortran/C/C-I— I- applications. The latest version of IRIS 
Explorer also comes with a built in collaborative environment in which 
application developers can cooperate on the same application and of 
course two- and three-dimensional visualisation is an integrated part of 
the environment. An application built in IRIS Explorer is simply a set of 
modules connected - a map (see Figure 1 and 2). An application devel- 
oper can easily create applications running on distributed machines by 
simply attaching modules that reside on those machines. The framework 
is such that the developer can create maps to represent the application 
and then deflne the user interface for that application. The end-user 
may see only the application interface which has been defined for their 
needs, incorporating output of a variety of forms including graphics and 
visualisation and slides or dials to alter parameters during the compu- 
tation. They do not need to know anything about the IRIS Explorer 
environment and modules in order to run the application. 

IRIS Explorer provides two user interface mechanisms: the graphical 
user interface (GUI) and a command interface. The command interface 
provides a method to run IRIS Explorer using text-based commands. 
Commands may be issued directly from the keyboard, or can be supplied 
in a script that IRIS Explorer runs, using the IRIS Explorer scripting 
language called Skm (pronounced scheme). The scripting language is 
useful for three reasons: 

■ In the case where a standard sequence of operations on a map may 
be desired such as in testing, this can be provided in skm. 

■ skm can also be useful in the case that a user wants to run an IRIS 
Explorer application in batch.. 

■ IRIS Explorer can be remotely from another system, such as a 
personal computer. 

There axe many advantages to using the IRIS Explorer as an appli- 
cation framework. First, the visual programming environment enables 
non-programmers to develop their applications in a user-friendly way. 
Secondly, the capabilities for defining new interfaces means that specific 
applications can be tailored to meet the needs of non-specialist end-users. 
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Thirdly, powerful visualisation facilities become integrated with the ap- 
plication and the user has the ability to interact with the application 
during the execution. 

Using IRIS Explorer in this application development mode has much 
in common with the more specialised system SciRUN [12] and other 
similar problem solving environments such as GasTURBLAB [13]. The 
SciRUN environment has many of the same features but is aimed es- 
sentially at the large-scale computations and as such as the essential 
modules for those problems integrated. In the GasTURBLAB PSE, 
IRIS Explorer provides the interface to a complex problem solving envi- 
ronment. 

IRIS Explorer has provided this type of infrastructure in three 
systems developed within European funded projects in which NAG is a 
partner. One of the results from the collaboration in these projects is 
that the development of IRIS Explorer itself is in pertinent directions 
to the application areas and the needs of an application development 
environment. In the next sections we will review the issues raised in 
each of these projects and the resulting developments. 



3. IRIS EXPLORER AS DEVELOPMENT 
ENVIRONMENT 

On this section we review the three projects and the ways in which 
the computational components and IRIS Explorer have been utilised. 



3.1. THE STABLE PROJECT 

The aim of the STABLE project was to design, build and demonstrate 
a modern Statistical Application Building Environment. 

The STABLE system was an integration of IRIS Explorer, and an 
existing widely used statistical software system, Genstat [14]. Part of 
the project was to build and test the system within three different ap- 
plication areas. A team worked within each of these application areas 
to provide the specifications and build the particular applications from 
the prototype system. The three application areas were: 

■ Design of Experiments and Statistical Process Control for the pro- 
cess for the manufacture of aerosol cans. 

■ Analysis of field experiments for agricultural crops. 

■ Forecasts of the electricity demand on the Balearic Islands. 
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The end-user problems are an important aspect of the project as they 
will not only inform the design of computational modules, providing 
a test bed during the development phase, but will also demonstrate 
the effectiveness of the platform in tackling ’real world’ problems. The 
resulting STABLE environment allows experimentation and prototyping 
while at the same time provides a framework for the development of 
complex applications. 

In the follow sections we will describe the Stable system and the re- 
engineering and integration that was required. 

The statistical components in STABLE are based mainly on the Gen- 
stat statistical system with additional functionality taken from the NAG 
libraries. Genstat covers all the standard statistical techniques such as 
analysis of variance, regression and generalised linear models, curve fit- 
ting, multivariate and cluster analysis, estimation of variance compo- 
nents, analysis of unbalanced mixed models, tabulation and time series. 
It also offers many advanced techniques, including generalised additive 
models and the analysis of repeated measurements. 

Genstat is a package created in the 1970s and like many packages of 
that era, it was written in FORTRAN with a central pool of data and a 
dependence on COMMON for communication of that data. 

Genstat is based on command-interpreters that execute commands 
issued by the user. Each interpreter is implemented as a hierarchy of 
functions that interface with the kernel to obtain or define data. In 
order to create STABLE modules from the Genstat algorithms it was 
necessary to restructure each interpreter so that the algorithmic code 
became a closed system, whose input was a data array, control param- 
eters, and workspace, and whose output was another data object. This 
implies that the architecture with a central pool of data and COMMON 
communications had to be re-engineered so that it operated on passed 
arguments rather than using pointers into the main data array. The re- 
sults were a set of statistical algorithmic Dynamic Link Libraries (DLL) 
which could then be incorporated into the system as modules. 

The integration of the two separate elements, Genstat DLLs and IRIS 
Explorer, required the definition of a data structure, which provided the 
interface to the statistical modules and carried all information required 
from the input of information to the graphical or other interpretation of 
the data. The data structure, a DataTablet, allowed a fiexible infrastruc- 
ture and an open system. Once an application developer is familiar with 
the DataTablet it is easy to integrate an existing program or routines 
into the system. As data is passed through the application all infor- 
mation required ensuring the correct action on the data is immediately 
available. 
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Figure 1 A Stable Example 



Figure 1 shows a simple STABLE example. The module ReadAscii 
inputs data from a file. In this case the user wants to perform a Time 
Series analysis of the data using the TSACF module. The module 
TypeColumn is used to set the type of data to a univariate time series. 
The analysis is performed, the control panel of TSACF allows the users 
to vary the number of lags and coefficients to be used, and the results 
directed to GraphView. Since GraphView knows from the datatype 
that this is Time Series data then it knows the type of visualisations 
suitable and provides the figures in the background. The tools to rotate 
and zoom in and out of the plot are an integrated part of IRIS Explorer. 



3.2. THE DECISION PROJECT 

The Decision project was an HPCN-funded project in Integrated Op- 
timisation Strategies for Increased Engineering Design Complexity. The 
goal of the project was to a tool for decision makers in new product 
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design optimisation. Just as in the STABLE project, the development 
of the Decision platform, DEEP (Decision Explorer Engineering Plat- 
form) was driven by industrial applications provided by Nokka Tume a 
Finnish manufacturer of forestry machinery, MESSET a Finnish manu- 
facturer Electromechanical films, and VTT an independent Finnish re- 
search centre. Messet and VTT supplied test problems concerned with 
structural design of electromechanical film (EMF) fiat loudspeeikers and 
with their optimal placement in a room. Also within the project was the 
development of new optimisation techniques for some of the applications 
involved. 

There were three major stages in the DECISION project: 

1 Specification and design 

2 Development of optimizers, including: traditional deterministic al- 
gorithms, gradient-based algorithms, non-smooth algorithms, mul- 
ticriteria and stochastic algorithms. 

3 Development of the optimisation platform within the industrial 
problem areas, followed through by the development of commercial 
products. 

Thanks to the tools within IRIS Explorer which allow the creation of 
new data types and modules it is relatively straightforward to make any 
of the simulation software into a module of IRIS Explorer and connect 
to any of the optimisers within the DEEP framework. 

The DEEP platform has at its core one module - the Controller mod- 
ule. This acts as a front-end to any of the optimisation algorithms and 
is provided for the non-specialist end-user to run a previously defined 
optimisation process. The optimisation modules themselves have been 
designed such that the more sophisticated user can access them directly, 
bypassing the Controller module. 

The optimisation algorithms incorporated in the system have a variety 
of requirements in terms of input data and consequent results. In order 
to create a plug-and-play environment for the optimisers it was necessary 
to define a minimum set of inputs and outputs for the IRIS Explorer 
module environment. Interface software to meet this specification was 
then created for existing optimisers. 

There is often a need with optimisation functions for information re- 
garding the objective function or constraints to be provided by user- 
defined functions. In programming languages such as FORTRAN, C or 
C-H- 1- such a function may be provided as an input parameter to the 
optimisation routine or alternatively the optimisation routine may halt 
execution and return to the caller when it requires such information. 
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Figure 2 A Decision Map 



The latter method, usually referred to as reverse communication, pro- 
vides more flexibility to the user whilst the former, usually referred to 
as call-back, is perhaps more intuitive and efficient. Within the DEEP 
platform it was important to retain the efiiciencies of the call-back sys- 
tem but in order to interface with integration systems, such as IRIS 
Explorer, to provide the extra functionality of reverse communication. 
Hence a library was created to help programmers write a function in the 
call-back model but easily create a reverse-communication interface to 
it. 

As with any application development environment it was also neces- 
sary to ensure the system was able to integrate with other systems or 
environments. In particular in this case one of the solvers that was to be 
used in an application was written in Matlab. Rather than rewrite the 
solver in another programming language, a module, ToMatlab was de- 
veloped which would accept any solver written in Matlab as an objective 
or constraint function. 
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Figure 8 The Control panel and Graphics of the Decision Map 



Figure 2 and 3 illustrate a typical map and the user interface to a 
particular problem. Once the users chooses the optimiser that should 
be used, the map is automatically configured for that choice. 

3.3. THE JULIUS PROJECT 

The focus of the Julius project [ref] was the creation of an integrated 
environment for multidisciplinary engineering simulation, with applica- 
tions in the aerospace, automotive and manufacturing sectors. Again 
there were industrial partners with large applications and a variety of 
algorithmic and other needs. The software pieces involved in the in- 
tegrated environment included CAD input and repair, a range of 3D 
mesh generation capabilities, facilities for integrating numerical simula- 
tion packages, data extraction and visualization, data base integration, 
parallel tools and resource scheduling. The simluations and computa- 
tions were taking place on a variety of compute platforms from PCs to 
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high-performance parallel systems. See figure 4 for an illustration of the 
Julius framework. 





Figure 4 The Julius framework 

In Figure 4 the various JULIUS components are wrapped as CORBA 
objects. The front-end of any parallel component and the database 
(MEM-COM) are also wrapped as CORBA objects. IRIS Explorer runs 
on the local machine and provides the visual programming front-end, 
maintaining the workfiow or the metaprogramming sequence. For each 
JULIUS component there is a corresponding IRIS Explorer module on 
the local machine. When a module is plaeed in the map editor its user 
function binds to the relevant CORBA object. Data objects may also 
be wrapped as CORBA objects and their object references may be com- 
municated through the IRIS Explorer input and output ports. 

The architecture allows a parallel application to employ MPI exclu- 
sively within its computationally intensive code. Its CORBA-wrapped 
front end allows it to be integrated with the other JULIUS tools. 
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The MEM-COM front end is a CORBA object providing methods to 
store and retrieve CORBA data objects from within the database. These 
methods may be invoked directly by other CORBA-wrapped tools within 
the system. 

The advantage of this system is that an application developer can 
sit at a Linux PC and build and run large-scale applications efficiently 
on a mixture of computing resources. The workflow sequences can 
be assembled, modified, stored and retrieved using IRIS Explorer 
visual programming. Using widgets in the IRIS Explorer modules it is 
possible to implement computational steering of each component. At 
any stage in the simulation results can be returned to the PC where 
the visualisation capabilities of IRIS Explorer come into their own. 
The data extraction components can be used to significantly reduce 
the amount of data to be passed to the visualization modules, thereby 
increasing the efficiency. 



4. CONCLUSIONS 

IRIS Explorer is proving to be an very effective environment for ap- 
plication development. The European projects discussed above have 
provided a good deal of feedback from which we have been able to im- 
prove IRIS Explorer with application development in mind. It has been 
clear from these studies that the visual programming interface and the 
ability to design the module interfaces to meet the needs of the end-user 
are extremely beneficial - raising the level of the usability of the envi- 
ronment from a sophisticated programmer to a less programming aware 
audience. 

Some of the improvements to be made available in future releases 
of IRIS Explorer include the ability to create a single module out of a 
group of modules - this was very mcuh a rquirement from the Stable 
project where the applications developed where highly complex and re- 
quired 0(200) modules. This enhancement to the system will improve 
the efficiency of the application and also the time to load the map into 
memory as the application begins. 

The DEEP environment of the DECISION project identified another 
area where IRIS Explorer could be improved - in the area of loops in 
maps. In the optimization algorithm maps, such as in figure 2, execution 
of the loops within the map need to be very efficient. Once this issue had 
been identified in the project, this feature in IRIS Explorer was refined, 
hence again making a more efficient system for applications with this 
requirement. 
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Both the Stable and Decision projects have led to software compo- 
nents that can be commercially exploited but beyond that they have led 
to tools and experience that can be applied in building such environ- 
ments in a broader context. 

Clearly the Julius project was very ambitious, involving the integra- 
tion of software components from a wide range of diverse sources. It is 
not smprising that the JULIUS consortium found the task of defining 
cin architectme suitable for the needs of all industrial partners and po- 
tential end-users particularly challenging. The result of this was that 
prototype versions of the software were significantly delayed. In its role 
as exploitation channel for the project NAG found it increasingly diffi- 
cult to maintain a viable exploitation plan for the software deliverables, 
and eventually chose to withdraw from the project. Despite this we con- 
sider the experience gained in integrating IRIS Explorer with CORBA, 
and with visualization of very large data-sets, valuable. The project also 
showed that the IRIS Explorer visual programming model has much to 
offer problem solving environments in this area. 

The task of integrating many varied software components while main- 
taining efficiency and attempting to satisfy the sometimes diverse re- 
quirements of end-users is clearly very challenging. NAG is continuing 
developments in this area to consider a broader provision of algorithms 
and tools within the IRIS Explorer framework, to provide an integrated, 
open environment for a wide range of application areas. 
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DISCUSSION 

Speaker: Anne Trefethen 

Morven Gentleman : For the re-engineering of Genstat to produce 
STABLE, how much did you have to rely on prgram understanding tools 
and techniques, and how much were you able to exploit understanding 
and advice from the current maintenance staff or original developers 
(such as John Nelder) from Rothamsted. 

Anne Trefethen : The re-engineering work of Genstat was largely com- 
pleted by the Genstat developers at Rothamsted who were collaborators 
in the STABLE project. Genstat is developed by a team that evolves in 
the same way as the program itself. Some of the original developers are 
still working on Genstat, while others have handed their responsibilities 
on to the next generation. So, for example, when John Nelder retired 
in 1985, he passed on the leadership of the Genstat project to Roger 
Payne (who worked on the STABLE project). Nevertheless, the current 
team try to keep close links with their predecessors. John Nelder, in 
fact, remains a very keen Genstat user, has been very interested and 
impressed by the new framework that STABLE is providing. [The au- 
thors would like to thank Roger Payne for his input to the details of the 
answer provided here.] 

Richard Fateman : Has Axiom been considered as a software compo- 
nent in these or other projects? What’s happening with Axiom? 

Anne Trefethen : Axiom was not considered in the projects discussed 
here but has been involved in other such projects. In particular Ax- 
iom was part of the FRISCO project (Framework for Integrated Sym- 
bolic/Numeric Computation) a project funded by the European Com- 
mission under the Esprit Reactive LTR Scheme (project No. 21.024). 
NAG is in the process of creating what will be the final release of the 
Axiom product which will be released with support for a limited period. 
Morven Gentleman : You said that the basic module in the software 
architecture that IRIS Explorer can express is single input/single out- 
put. In the standard studies of software architecture such as Gar Ian and 
Shaw, this architectural style is very limited unless what is passed in 
the dataflow is complete databases, in which case the datafiow becomes 
meaningless. Have you found this to be a problem? 

Anne Trefethen : In my presentation I mispoke about the single in- 
put/single output. Although there appears to be a single pipe of in- 
formation entering a module, that pipe may represent several inputs or 
outputs. In the Stable project, however, we did find the need to create a 
new datatype for the system to provide sufiicient information about the 
statistical nature of the inputs and outputs. In a way this formed the 
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interface “glue” between the IRIS Explorer system and the underlying 
Genstat algorithms. 

Robert van de Geijn : The use of a graphical programming language 
creates the opportunity to arrange icons to make the “picture” (that one 
would use to explain an algoritm) become the implementation. Here I 
refer to more than going back to flow-charts. 

Anne Trefethen : I agree this opportunity is a strength of a visu^d 
programming language, particularly within an educational context. Our 
collaborators within the STABLE project also found the two dimensional 
representation of their algorithm, as opposed to a one dimensional piece 
of code, allowed new insights into their application/algorithm. 
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SCIENTIFIC SIMULATIONS* 
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Abstract In many applications involving large scale scientific computing a mas- 
sive amount of data is generated in a typical simulation. Such simula- 
tions generally require the numerical solution of systems of differential 
equations where the results are often generated remotely using special 
high-performance software and computer systems and then examined 
and investigated interactively using visualization tools. The visualiza- 
tion packages are usually run on local workstations and make use of 
colour, lighting, texture, sound and animation to highlight and reveal 
interesting characteristics or features of the approximate solution. This 
‘interactive’ viewing of the data is the ‘rendering’ stage of the modeling 
process and it can be very selective and local in the sense that only 
a subset of the variables are rendered and then only in regions where 
something interesting is happening. The result is that, in many simula- 
tions, large amounts of data must be stored and shared in a distributed 
environment while only a very small fraction of this data will ever be 
viewed by anyone. In this paper we propose an approach, using a hier- 
archichal representation of the approximate solution, which will avoid 
the generation and storage of data that is not required. 

Keywords: scientific visualization, multivariate interpolation, data compression, hi- 
erarchichal data structures, distributed computing, ODEs, PDEs. 

1. INTRODUCTION 

In scientific simulations results are often represented by approximate 
solutions to large systems of differential equations - either partial dif- 
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ferential equations (PDEs) or ordinary differential equations (ODEs). 
Such simulations arise in all areas of science and engineering. With the 
dramatic increase in computing power and the development of more ac- 
curate and realistic mathematical models there has been an evolution in 
how the results of a simulation can be viewed and/or interpreted. 

It is typically the case that an underlying numerical method will de- 
termine a discrete approximation to the solution on an adaptive unstruc- 
tured mesh. While tables of the numerical values associated with the 
meshpoints cam be used to represent the discrete solution, such tables axe 
difficult to interpret or understand. A research area has been developed 
to address the question of how to present and display results in the most 
effective way. This area, scientific visualization, has introduced color, 
sound and animation together with other computer graphics techniques 
to effectively display key properties of the results of large scale simu- 
lations. In order to use these techniques effectively the property to be 
rendered must-be represented on a fine enough mesh to match the reso- 
lution of the hardware display device being used (typically a computer 
monitor). It must be emphasized that this resolution is independent 
of the underlying discrete mesh which is a property of the underlying 
numerical method and the accuracy that is specified in the simulation. 
The accuracy of the simulation affects the number of meshpoints and 
the total computer time required to generate the approximate solution 
on the initial (coarse) mesh. The resolution of the renderer will impose 
a constraint on how fine the ‘refined’ mesh must be in order to avoid 
distracting artifacts in the displayed images. 

The key idea we advocate in this paper is the use of multivariate 
piecewise polynomial interpolants to represent the results of scientific 
simulations. These piecewise polynomials are defined on the coarse ini- 
tial mesh, interpolating the discrete data, and they can be evaluated 
during the rendering process to deliver a discrete approximation to the 
solution on an arbitrary mesh (that is consistent with the initial mesh). 
It is crucial develop a language-independent representation of such piece- 
wise polynomials so that different application environments can access 
and directly manipulate or extend these results. 

We propose the use of a paradigm where the ‘approximate’ solution 
associated with a simulation is defined in terms of an approximation on a 
hierarchichal collection of associated consistent meshes. An underlying 
numerical method explicitly produces a discrete approximate solution 
on a relatively coarse initial mesh. This initial approximation, together 
with a generic procedure for ‘extending’ this approximation to any mesh 
that is a refinement of the initial mesh, defines the ‘general’ approximate 
solution. The key implication is that the data corresponding to a refined 
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mesh (associated with the general approximate solution) need not be 
explicitly determined until it is needed since only the data associated 
with the coarse mesh is necessary to represent or characterize the generail 
approximate solution. 

Examples of this approach arising from models using ordinary and 
partial differential equations are presented to illustrate the significant 
reductions in both storage requirements and computer time that can be 
achieved. We show that speed up factors and storage reduction factors 
of between 10 and 1000 can be reahsed and that this approach can 
be particularly effective when working with 2D and 3D mathematical 
models. 



1.1. BACKGROUND MOTIVATION 

The hierarchichal data structure associated with our approeich can 
adapt dynamically to refiect the changing demands of the interactive 
rendering process. In particular in applications involving ‘data mining’ 
or ‘feature discovery’ the renderer will be ‘steered’ by a user to focus on 
particular subsets of the data on parts of the domain of interest. 

An efficient implementation of this approach will involve a careful 
matching of the data structure, used to represent the general numeri- 
cal solution to the hierarchichal hardware that is available to store the 
information. The approach we develop makes use of local information 
only (in the refinement step) and is therefore suitable for use in a paral- 
lel environment. That is, the polynomial associated with each element 
of the coarse mesh can be determined and rendered independently and 
without synchronization. 

Although the examples we introduce and present in this paper have 
been implemented in MATLAB (primarily since it is an effective envi- 
ronment for rapid development of prototype ‘proof of concept’ systems) 
more efficient object-oriented systems are under development in C-f- 1-. 

1.2. OUTLINE OF PRESENTATION 

In the next section we will describe the approach for one Dimensional 
problems (ODEs) and show how recent research in the development of 
‘Continuous’ numerical methods for ODEs cein be interpreted as a canon- 
ical example of a general numerical solution. This general numerical so- 
lution will be represented by a univariate piecewise polynomial defined 
on the coarse mesh (associated with the underlying discrete approxi- 
mation). The piecewise polynomial can be evaluated at any point in 
the interval of interest. In particular it can be used to approximate the 
solution on any mesh that is a refinement of the original coarse mesh. 
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We will then, in section 3, show how this approach generalizes to 
two and three Dimensional problems (PDEs), where multivariate piece- 
wise polynomials are associated with the coarse mesh. These piecewise 
polynomials then represent the general solution and they can be used 
to approximate the solution at any point in the domain of interest (in 
paxticulax, at any point that arises during the rendering process). We 
present detailed results for two example PDE problems which illustrate 
and quantify some of the advantages and limitations of this approach. 
These examples reveal the typical trade-off and relationships that axe ob- 
served between the resolution of the final visualization and the coarseness 
or accuracy of the general solution. 

We will conclude the paper with a discussion of current ongoing re- 
lated work. We will also discuss the implications for high level problem 
solving environments where this approach can be used effectively pro- 
vided one adopts a generic language-independent representation for the 
problem specification and the general approximate solution. 

2. ODES: THE ‘EASY’ CASE: 

2.1. THE GENERIC APPROACH IN ONE 
DIMENSION 

Consider the l*Corder system of ODEs 

y' = 

A p*^-order, s-stage RK method determines 



yi = yi-\ + h'Y^(jjjkj, 

where 

S 

kj — J — 1 -1" hcj ^ -j- h (Xj^ky'^t 

r=l 

A Continuous extension (CRK) is determined by adding extra stages to 
obtain an order p approximation for x G {xi-i,Xi) 



Ui{x) = yi-i + ^ bj{^—^)kj, 

where bj{r) is a polynomial of degree p. 

That is, one determines ks+i,ks +2 •• -kg and polynomials bj{T) to en- 
sure: 



Ui{xi) = Vi, 
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= fixhVi), 

Ui(x) = y{x) + 0{h^), X € [a;t_i,a;i]. 

Collectively the Ui{x) define a piecewise polynomial approximation 
to y{x) that is 0{h^) accurate for x G [0:0,0:^?], on an associated mesh 
xq < xi ■ ■ ■ < Xff = xp. A refinement of this coarse mesh can be defined 
by introducing K equally spaced points per subinterval, and 

one can evaluate the piecewise polynomial on this fine mesh to obtain a 
discrete approximation that is more appropriate for rendering. 

2.2. A ONE DIMENSIONAL EXAMPLE 

A predator-prey relationship can be modeled by the IVP: 

Vi = yi - 0.1yiy2 + 0.02a; 

V 2 = -ya + 0.02j/iy2 + 0.008a: 

with 

yi(0) = 30, y2(0) =20, xG [0,40]. 

where yi(x) represents the ‘prey’ population at time x and y2(x) rep- 
resents the ‘predator’ population at time x. The solution can then be 
visualized using a standard yjx solution plot or using a ‘phase plane’ 
plot. Consider the use of a standard ODE solver such as ode45 of MAT- 
LAB which is based on a fourth order CRK formula. 

Figure 1 shows the results when one attempts to display the solution 
using the discrete solution only. In this case the simulation presented no 
numerical difficulties and ode45 determined an accurate approximation 
with fewer than 50 time steps. Direct rendering of this discrete solu- 
tion gives a distorted impression of the ‘shape’ of the solution and the 
behaviour of the phase-plane ‘portrait’ of the solution. 

For one-dimensional visualization, the rendering process can some- 
times use cubic splines to present a more realistic or less distracting 
view of the results. For example the cubic spline that interpolates the 
discrete values produced by ode45 can be evaluated on a fine mesh (we 
have used a fine mesh obtained by subdividing each of the coarse-mesh 
intervals associated with the discrete solution into 20 equally spax;ed sub- 
intervals). Figure 2 illustrates this technique for rendering the discrete 
solution generated by ode45. This is definitely an improvement but the 
fixed low order of this technique limits its effectiveness as a general solu- 
tion. Note that while the solution plots are realistic and the phase plane 
portraits axe less distracting they still contain unrealistic features that 
are not present in the true solution. 




306 ARCHITECTURE OF SCIENTIFIC SOFTWARE 



Muton UMng OlMrai* toliilofi 



PIra Plot UMTf OiMM 8o 





(a) Solution Plot 



(b) Phase Plane Plot 



Figure 1 Visualization Using Piecewise Linear Interpolation. 





(a) Solution Plot 



(b) Phase Plane Plot 



Figure 2 Visualization Using Spline Interpolation. 



The special interpolant used to generate the visualizations presented 
in Figure 3 are associated with the continuous Runge-Kutta formula 
that is the basis of the method ode45. We have evaluated the underlying 
piecewise polynomial at the same fine mesh that we used with the cubic 
spline visualization to obtain these results. Note that, in particular, the 
oflf-mesh accuracy of this latter piecewise polynomial leads to a much 
more realistic phase plane portrait. 
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(a) Solution Plot 



(b) Phase Plane Plot 



Figure 3 Visualization Using Special Interpolation. 

3. APPLICATION TO PDES 

3.1. THE GENERIC APPROACH IN TWO 
DIMENSIONS AND HIGHER: 

Assume the PDE is a 2D, 2”‘^-order problem: 

Lu = g{x,y,u,Ux,Uy), (1) 



where 

L = ai{x, y)uxx + 02 ( 3 :, y)uyy + as{x, y)uxy. 

Assume also that an underlying method has generated a discrete ap- 
proximation to u, Ux and Uy on a mesh consisting of rectangular el- 
ements. The approximations associated with each meshpoint, {x,y), 
satisfy 



u(x,y)-u = O(hP^), 

Ux(x,y)-Ux = O(hP^), ( 2 ) 

Uy{x,y)-Uy = 0{hP^), 

where h is the maximum mesh spacing in either x or y. Note that with 
this assumption we have 0{h?) accureicy at off-mesh points if we use 
piecewise linear interpolation of the mesh data. For each element, e. We 
determine a bivariate polynomial, Pd,e(ic, y), of degree d that 

■ interpolates the mesh data (12 constraints) 

■ almost satisfies (1) at m collocation points 
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■ provides optimum accuracy at off-mesh points 
The best we can expect is for {x, y) G e 

u{x,y) -pd,e{x,y) = 0{hP), 

= 0(A"->), (3) 

u,(x,y) -^Si^ = 
where p = min{pi,p 2 -I- 1, d -H 1). 

For a detailed discussion and justification of this approach see [1]. 

3.2. A TWO DIMENSIONAL EXAMPLE: 

Consider the following PDE from the ELLPACK collection [3]: 
y>xx = cos{ny)u - (1 -h sin(7ra:))ttj; -f- f{x, y) 
on the domain 

0<a;<l, 0<y<l, 
with boundary conditions 

u{x, y) = cos{By) -I- sin B{x — y), 

where B = n or B = 10 and f{x, y) is defined to ensure that the true so- 
lution agrees with that specified on the boundary for the whole domain. 

For this PDE defined hy B = 10, visualizing with an 8 x 8 coarse 
mesh (Discrete solution only), and no refinement for rendering we ob- 
tain the results presented in Figure 4. Note that the off-mesh values 
required in this rendering are generated using bivariate piecewise linear 
interpolation. Clearly these results are distracting and not satisfactory. 

When our approach is applied to improve the resolution of this discrete 
solution we can use a piecewise cubic evaluated at the refined mesh 
defined by subdividing each of the 64 elements of the discrete solution 
with a 4x4 uniform partitioning. The resulting visualization is presented 
in Figure 5 where the improved resolution has resulted in a more realistic 
and less distracting display of the solution. 

Even better resolution is possible by evaluating the same piecewise 
cubic on a finer mesh prior to rendering. When a 16 x 16 partitioning 
of each of the 64 elements is used one obtains the results presented in 
Figure 6 where the lighting effects and contour lines are now much more 
natural. 

To illustrate the effects of using too coarse a mesh for the discrete 
solution we show in Figure 7 the results corresponding to an accurate 
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(a) Surface Plot (b) Contour Plot 

Figure 4 Visualization Using Piecewise Linear Interpolation. 





(a) Surface Plot (b) Contour Plot 

Figure 5 Visualization Using Piecewise Polynomial on 8 x 8 mesh with 4x4 refine- 
ment. 



discrete solution defined on a 4 x 4 coarse mesh with the associated 
piecewise cubic evaluated on a fine mesh obtained by a uniform 32 x 32 
partitioning of each of the 16 elements. In this case the ‘ripples’ in the 
surface plots and the distortions evident in the contour lines refiect the 
infiuence of the underlying coarse mesh. 
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(a) Siirface Plot 



(b) Contour Plot 



Figure 6 Visualization Using Piecewise Polynomial on 8 x 8 mesh with 16 x 16 
refinement. 




14H 



(a) Surface Plot 



(b) Contour Plot 



Figure 1 Visualization Using Piecewise Polynomial with 4x4 mesh and 32 x 32 
Refinement. 



3.3. A THREE DIMENSIONAL EXAMPLE: 

Consider the wave equation with two spatial dimensions describing a 
clamped vibrating membrane: 

"I” liyy') — O 5 

on the domain: 0<t<2, 0<a;<2, 0<y<2, with boundary 
conditions: u{t, x, y) = 0, and initial conditions: 

it(0,a:,y) = 0.1sm(7ra;)sm(7ry/2), itt(0,a:,y) = 0. 
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To display the solution to this problem one can use ‘animation’ and 
generate a movie representation of the solution by determining ‘snap- 
shots’ of the solution for various values of t >= 0. For example, 100 
values of t in the range 0 < < < 2 gives a smooth animation that ef- 
fectively diplays the motion of the membrane as it passes through one 
oscillation. This realistic visualization would then require the genera- 
tion of 100 snapshots. We will use snapshots at t = 0, < = 1.34 and 
t = 2.0 to illustrate the typical resolution of this way of diplaying the 
results. For example if we use our approach with a 10 x 10 x 10 coarse 
mesh and a uniform 10 x 10 x 10 partitioning of each element to define 
the refined mesh we can use a tricubic piecewise polynomial to obtain 
the results presented in Figures 8, 9 and 10. Note that the complete 
animation (100 snapshots - each comprised of a smface or contour plot 
involving the solution approximated at 10000 points) is represented by 
the discrete values associated with the coarse mesh(consisting of only 
1000 elements). This corresponds to a ‘compression’ factor of 1000, 
with respect to storage. 

At this resolution some artifacts of the coarse mesh are visible and 
may be distracting. To obtain better visualization one could increase 
the degree of the piecewise polynomial or use a different initiad coarse 
mesh. For example Figures 11, 12 and 13 show the results corresponding 
to the use of tricubics on this problem with a 20 x 20 x 20 coarse mesh 
and a uniform 5x5x5 partitioning of each element to define the refined 
mesh. 





(a) Surface Plot 



(b) Contour Plot 



Figure 8 Visualization Using Piecewise Polynomial on 10 x 10 x 10 mesh with 10 x 
10 X 10 refinement at t = 0. 




Figure 9 Visualization Using Piecewise Polynomial onlOxlOxlO mesh with 10 x 
10 X 10 refinement at t = 1.34. 



(a) Surface Plot 



(b) Contour Plot 



(a) Surface Plot 



(b) Contour Plot 



Figure 10 Visualization Using Piecewise Polynomial on 10 x 10 x 10 mesh with 
10 X 10 X 10 refinement at t = 2.0. 



4. EXTENSIONS AND IMPLICATIONS 

4.1. EXTENSIONS AND RELATED 
APPLICATIONS 

Although, in our examples involving PDEs, we have focused on tensor 
product meshes with multivariate cubic interpolation the approach can 
be applied to unstructured triangular (or tetrahedral) meshes and higher 
degree multivariate interpolation. (See [1] for a detailed discussion of the 
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(a) Surface Plot 




(b) Contour Plot 



Figure 1 1 Visualization Using Piecewise Polynomial on 20 x 20 x 20 mesh with 5x5x5 
refinement at f = 0. 







(a) Surface Plot 



(b) Contour Plot 



Figure 12 Visualization Using Piecewise Polynomial on 20 x 20 x 20 mesh with 5x5x5 
refinement at f = 1.34. 



general applicability of the approach.) The underlying piecewise polyno- 
mial can also be directly exploited for error estimation or discontinuity 
detection and these extensions are being investigated at this time. 

In the rendering application we have considered in this paper as well as 
in other application discussed elsewhere it is critical that the data struc- 
ture used to store the general approximate solution be carefully chosen 
and available for use in a multi-language environment. Typical opera- 
tions or transformations that can arise and which should be efficiently 
implemented include the multipoint evaluation of the general approxi- 
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(a) Surface Plot 



(b) Contour Plot 



Figure 1 3 Visualization Using Piecewise Polynomial on 20 x 20 x 20 mesh with 5x5x5 
refinement at t = 2.0. 

mate solution, the location of extreme values of the general approximate 
solution and the checking of whether two different approximate solutions 
are ‘close’ to each other. The computer language most appropriate for 
implementing such operations will vary but a well designed data struc- 
ture can be effectively shared by all such applications. 

In order to carry out the numerical experiments reported in this paper 
we have adopted a recursive data structure using a rooted tree represen- 
tation of the general approximate solution and implemented it in such 
a way that it could be accessed and manipulated using Fortran, Matlab 
or C. 

Once one adopts such a generalized data structure other questions 
arise which require further investigation. For example we have assumed 
that the renderer must know the solution on a uniform, fine, rectangular 
mesh before it can display the solution. If one is interested for example 
in displaying contour lines one may be able to do this directly from 
the piecewise polynomial, without having to first perform a multipoint 
evaluation on a very fine uniform mesh (see [2] for an example of how 
contour lines can be directly computed). 
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DISCUSSION 

Speaker: Wayne Enright 

Morven Gentleman : You have pointed out that the discretization 
needs for solving the problem accurately and rendering it satisfactorily 
may be different. You have then examined the situation where render- 
ing requires finer resolution. This weekend we were shown a case where 
coarser resolution for display created a sampling artifact that was mis- 
leading. Have you considered this case? 

Wayne Enright ; This certainly can happen. I would envisage a col- 
lection of interpolants being available (of increasing order) where a user 
could observe more accmate interpolation (for the same data) by observ- 
ing the different renderings associated with a sequence of more accurate 
interpolants. 

Ian Gladwell : Please explain why your approach to refinement based 
on using the differential equation is better than relying on the smoothing 
functions of the visualization software. 

Wayne Enright : Without knowledge of the differential equation 
the visualizer can only use assumptions regarding global continu- 
ity/smoothness and additional nearby data values to define an appro- 
priate piecewise polynomial. This will usually result in a non-local com- 
putation and accuracy that will depend on the size of elements in the 
neighborhood (rather than depending only on the size of the single as- 
sociated element). 




SOFTWARE ARCHITECTURE FOR THE 
INVESTIGATION OF CONTROLLABLE 
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Abstract Investigation of a large control system needs rather sophisticated ap- 
plications, which would satisfy such users’ requirements as necessity of 
unprescribed modifying all elements of model and data set under consid- 
eration depending on the current results of calculations; opportunities 
for use of different methods to solve the same problem or the same part 
of problem most effectively; typically experimental character of compu- 
tations defined by contentual task (comparisons, generating of versions 
to analyze these ore those dependencies, etc); adaptation to the user as 
an object expert, not as a programmer. 

There are proposed basic principles of regional modeling and soft- 
ware architecture for creation and proceeding of such models. Socio- 
ecologo-economical modeling of region with optimization of its develop- 
ment strategy is considered. The main topics are related to software 
supporting of modeling, model proceeding and representation, usage of 
optimizing algorithms and creation of optimization procedures for re- 
gional development. 

A sample software application and results of numerical experiments, 
based on the model of Pereslavl region are also described. 

Keywords: socio-ecologo-economical modeling, optimal control, software architec- 
ture, multimethod procedures, regional development 

1. INTRODUCTION 

Investigation of a large control system needs rather sophisticated ap- 
plications, which would satisfy such users’ requirements as: 
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- necessity of unprescribed modifying eill elements of model and data 
set under consideration depending on the ciurrent results of calcu- 
lations; 

- opportunities for use of diflFerent methods to solve the same prob- 
lem or the same part of problem most effectively; 

- typically experimental character of computations defined by con- 
tentual task (comparisons, generating of versions to analyze these 
ore those dependencies, etc); 

- adaptation to the user as an object expert, not as a programmer. 

The socio-ecologo-economical model of a region with innovative sub- 
model and algorithms of optimization of development strategy (Carraro, 
Deissenberg, Gurman et al., 1999) serves as an expressive example of 
such a control system. 

The model with innovations for a region is constructed as a modifica- 



tion of the general model described in (Gurman, 1981) 

c={E-A)y-Bu-A^z- - A'^d - B<^u^ ( 1 ) 

0<y< T{k), 0<z< T^{k^) (2) 

Hv* 

r = — — I- N{r — r*) — Cy + C^z + inf — ex’" (3) 

at 

k = u — Sk, k^ =w — S^k^. (4) 



Here y is current traditional output; is output vector of the envi- 
ronmental sector; u and are investments in traditional capital k and 
in capital k^ of the environmental sector; c the real consumption; r is a 
vector describing the state of the environmental sphere. E is the iden- 
tity matrix and A, B, C, £>, A^,B^,C^, and are coefiicient matrices of 
appropriate dimensions; 6 and are diagonal matrices of depreciation 
rates 

The innovative part is described by 

e = -{[d\ + Hi,,^ + Hdis){e-e), e{o) = o, 
kd = _ Jd X 0 < d < r‘'(jk‘^), 

where d is a control vector representing the activity of innovative sec- 
tor which capital, capeicity and depreciation rate are correspondingly 
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and [(/] is a diagonal matrix as ’’diagonalized” vector d, 
matrices Hinv, [d], and i/diff correspond to three main sources of innova- 
tions: 

1) the investment which typically imply replacing older vintage capital 
with newer one, with better characteristics; 

2) specific expenditures for technology and management improvement 
that have no direct capacity effect; 

3) innovative diffusion processes and similar endogenous innovative 
ch 2 inges. 

Typically, 9 may represent the best available innovative level in similar 
regions. 

The positive impact of the basic capital, socium and environment 
quality can be accounted for via multiplying the right-hand side of (1) 
by some matrix K depending on related state variables. 

The innovative costs are presented in the main balance equation (1) 
by A'^d and 

The parameter vector a is defined as the vector of all stacked elements 
of the model’s coefficient matrices, i.e. of the coeflficient matrices and 
other parameters or their inverses so that ’’innovative improvement” 
means reducing the components of a. 

The number of parameters can be very large: roughly speaking, it 
is proportional to the squared number of state variables in the model. 
We shall therefore in practice not work with the vector of parameters 
a = (oi), but with a lower-dimensional, aggregated vector 6. This vector 
is obtained with the help of the following procedure. We first partition 
the elements of the vector a in m subsets Ij, j = 1,2, ... ,ni. To each 
subset Ij corresponds one element Oj of 6. The cmrent value of 6j is 
constructed according to: 




( 5 ) 



where Oj is the current value of the f-th parameter in Ij; rij is the number 
of parameters in Ij; Aoj = ai — ai (0); and a* (0) is the initial value of 
Cj. Thus, 9j describes the mean percentage change for the subset of 
parameters Ij. 

The values of 6j are disaggregated when required by ’’redistributing” 
them for each t among the parameters belonging to the subset Ij. Several 
rules can be used to perform this disaggregation, in particular: 

— by c hanging all parameters in the subset by the same percentage; 

— by weighting the changes according to the level of saturation of each 
given parameter, that is, its distance from its possible value boundary; 
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— according to empirical statistical distributions for the parameters; 

— using an additional conditional optimization with respect to the 
parameters subject to 0j, 

For this purpose we use the following formula 

O/ij “ (f ^ ^ 

where aij are are distributive weight parameters which can be chosen 
differently according above different distributing rules. 

As a result a complex nonlinear model is obtained even if the original 
model is linear. 

An optimization problem for this model is stated which is to maximize 
the following intertemporal welfare function 

Jt= ff[{l-fi)V-fxW]e-P^dt (6) 

where p > 0 is the discount factor, T < oo, and /z € [0, 1], V = pc, 
is the vector of investments in k^, Q^d is the current innovative cost, 
and is a penalty function for the violation of ’’soft” sustainability 
constraints, 

r G 

Formally (6) is represented as 

j = [(1 - ,j)V - pW {t, k, k^, r)-u‘‘- J{a) = 0, J(T) = Jt{) 

The problem of interest is to find a time profile for the control vectors 
y, u, z, w, d and that maximizes (6) subject to the model relations, 
the additional constraints on controls 

(^y, u, z, u^,d, e fi(t, k, k^, k^, r, 6), (7) 

“hard” sustainability constraints 

ik,r,k^)e^^{t), 

and given initial and terminal boundary conditions for k, r, k^, 6, and 
k^. 

The complexity and nonlinearity of the resulting integrated model 
does not allow one to use for it directly any special optimization method. 
However, if the original model does allow to do it in analytical way, may 
be approximately, then this opportunity can be used via the following 
multistep procedure. 

Step 1. Disregarding the equation 6, the diflferent model’s coeflBcients 
are considered as arbitrary given functions of time. 
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Step 2. The thus defined optimal control problem is solved by the 
special method as mentioned above under a series of idealized 
assumptions. 

Step 3. The control trajectory obtained under Step 2 is locally im- 
proved on account of control program {d(t)}. 

Step 4. Steps 1-3 are repeated until no improvement is possible. 

Step 5. The idealized assumptions are removed, and the solution 
is modified correspondingly by an iterative improvement 
method (see Belyshev and Shevchuk, 1999). 

In the steps 1-2 we assume the following: 

1) the controls u and z are unbounded; the equation for in(4), is 
dropped, and the term B^w in (1) is taken zero; 

2) the boundary conditions on fc, k^, and r are fix point ones. 

To solve these steps, we base on the extension principle (Gurman 1985, 
1997) which idea is to replace the original problem by an analogous one 
but free of some "difficult” constraints so that its solution either ex- 
actly coincides with or lies close to the solution of the original problem. 
Specifically we use "singular relaxation “ method where the extensions 
considered are so called singular relaxation, i.e. extensions that, ap- 
plied to a class of unbounded control systems, preserve their all integral 
characteristics. They allow to represent generalized optimal solutions as 
impulse modes in terms of regular ordinary difierential equations. 

2. DESCRIPTION OF THE SOFTWARE 
PACKAGE 

The following approaches to the corresponding software architecture 
were proposed and implemented in part: 

1 The level of elementary operations is increased in comparison with 
standard packages; 

2 The object-oriented design of model representation is used. It is 
important to give a researcher chance to effectively work with such 
huge sets of data and algorithms. This approax;h creates an internal 
hierarchy in a model; 

3 To store such complicated structures a special object database is 
needed, which is capable to keep different types of data; 

4 The model dynamics (such as difierential and difierence equations 
and various restrictions) are represented as a source code of some 




322 



ARCHITECTURE OF SCIENTIFIC SOFTWARE 




Figure 1 The scheme of software architecture 

interpreted language. This allows the user to change all the model 
equations parameters in runtime. It is also useful as a method to 
keep this data in common database; 

5 The model dynamics source code via is improved pre-compilation 
(supercompilation) ; 

6 The elements of artificial intelligent (expert systems) are used in 
multimethod procedures and when assembling the main program 
out of modules; 

7 Distributed and parallel computation are provided; 

8 Common interfaces to work with different optimizing algorithms 
designed by other developers, and possibility to create an opti- 
mization sequences are developed. 

One of the main our objectives is to create flexible software architec- 
ture to support different changes of models and algorithms and to give 
the user tools to verify each step of a computational experiment. 

A sample software package, based on above principles, has been cre- 
ated and used to investigate above regional model. 

Figure 1 represents the scheme of this software application. It con- 
sists of five main parts: (1) data management system, (2) optimizing 
algorithms library, (3) control module, (4) computation module, (5) rep- 
resentation module. 

1. The data management system contains full description of 
active (procedures, realizing model dependencies and differential equa- 
tions) and passive (static and dynamic scalars, vectors and matrices) 
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Figure 2 Data mEuiagement system 



parts of model. The model is considered as a whole object with inter- 
nal hierarchy and specific methods, or ’’elemental operations”, such as 
’’calculate the state in time t, starting from initial state and control pro- 
gram”, ’’calculate the value of functional at the time Tp" ■, ’’determine 
if the current path is locally optimal or not” and ’’proceed to the opti- 
mal path in some neighborhood of the current state” . Model procedmres 
as it has been written before, are stored as source codes in interpreted 
language. Interpreter of Java Script is used in current program ver- 
sion, therefore the data management system has mechanisms for script 
writing and debugging. 

2. The optimizing algorithms library contjiins collection of sub- 
programs, real iz ing algorithms for model optimization. Each subpro- 
gram is a dynamic link library (DLL). In view of the fact that the model 
is rather intricate, it is useful to have a multi-step optimizing procedures 
and a set of optimizing algorithms. In this case it is necessary to have a 
common interface for algorithms integration with the system and with 
each other. Such interface was described, so that it lets one to construct 
sequences of optimization, connecting procedures with each other. Such 
approach also lets one to insert new algorithms into the library without 
any changes in the main program. Thus the algorithms library is open 
for insertion of new methods and algorithms. 

3. The control module is made for the user-system cooperation be- 
tween. It should give the user a friendly and intuitively understandable 
interface, which helps partly to remove the problem of misunderstanding 
(for the end user) the model description and the interaiction formalism. 
Because of object-oriented design of the model representation, control 
tools are divided on groups, connected with special objects. 
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Figure 3 The optimizing algorithms library 

It is very important that the control module have tools to backtrack 
all the dependencies and overpatching in the model under investigation. 
User caji interrupt computational process and trace necessary parame- 
ters or he can print intermediate data and do any corrective actions to 
the model during the computation. These system properties give wide 
opportunities to the researcher in his work. 

4. The computational module is the most important for final pro- 
cedure. The argument for this module is prepared by the model which, 
as above example shows, may have very complicated data structure, a 
lot of variables of different nature divided on different groups. 

The calculations work under the following scheme: 

r(t-t- 1) = f{x{t),u{t),t), 

where x(t) is the state vector, u(t) is the control vector at the time t, 
and / is the model operator. 

The user chooses an optimization algorithm and starts to set algo- 
rithm parameters for the current situation. As parameters of algorithm 
we consider, for example, settings for exponential penalty functions such 
as F = ^ g^(x imai) jg important to set correct and effective coeffi- 
cients for penalty expressions, because they are responsible for observing 
the model restrictions. To solve this problem effectively it is planned to 
develop procedures supported by artificial intelligence tools, such as ex- 
pert systems. They will help the user to setup necessary parameters on 
the base of previous experience. 

In perspective it will be added a precompilation subprogram, which 
improves the source code of the model dynamics before usage. The 
means to clear program code and to decrease the computation times are 
described in (Turchin, 1985). 

5. The representation module is very important for research 
work. It helps to trace all changes and dependencies in the model and 
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Figure 5 The representation module 



to get the results. The software has to give the user a mechanism to get 
any information about model at any desired time, and to represent it 
in a convenient view. For more comfortable work and more qualitative 
representation it uses standard Windows libraries and prevalent software 
such as Microsoft Office. Optionally it can print data and plot graphs 
straight on Microsoft Excel sheets. Some results of this module are used 
in this paper as graphs and tables. 
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3. EXAMPLE; APPLICATION TO 
PERESLAVL REGION 

Pereslavl region is a part of Yaroslavl region of Russia. Its center, 
Pereslavl- Zalessky town, is situated on the famous Golden Ring of Rus- 
sia, halfway from Moscow to Yaroslavl city, on the lakeside Plestcheevo 
(aquatory of 51 sq. kilometers), which has the status of a natmral and 
historical memorial. Its territory is of 3300 sq. kilometers, emd present 
population is about 71 thousands. 

The leading branches of economy are photochemical, textile, food 
industries, and agriculture. There is a great potential for tourism and 
recreation taking into account comparatively good ecological conditions 
and numerous beautiful landscapes, historical relics and monuments. 
The model described in the Introduction was specified for this region 
according to the following formulas: 
r(fc)=7fc, 

prf = 

<r r < . < r < 

^ • 'mm — ' — 'max? ^ * 'mm — ' — 'max? 

W = Ydi mini if r ^ = 0, if r € 

n : < Ldem, 

L^^^ = lVy + Uz + l% 

qUmax + q^uLax + 9‘^«max = “X- 

Here and are the total labor demand, and specific de- 

mand row-vectors, corresponds to the admissible unemployment, 
and is the total labor resource available. The last relation means 
distribution of the total investment available in the region between 
different types of activity according the weight coefficient vectors q, q^, 
and q^. It is specified quantitatively in scenarios. 

The corresponding date are listed in the Table 1. 

The following basic scenarios of the region development were consid- 
ered. 

First scenario (I) is continuation of the trends of past basic period 
1995-1998 up to 2014 without optimization, and without active control 
for the ecological and social and innovative components ( 2 : = 0 and 
d = 0). 

Other scenarios were considered on the period 1999-2014 (15 years). 
They include approximately global optimization accompanied by itera- 
tive improvement. 

Second scenario (II) is the same as (I) but with approximate optimiza- 
tion without active innovative control, to reveal the level of structural 
efiiciency. 




Controllable Models with Complex Data Sets 327 



Third scenario (III) presumes that the innovative control is cictivated 
under different restrictions on investments (several versions). The fastest 
gaining of structural efficiency is taken for the innovative control pro- 
gram as initial approximation. 



Table 1 Pereslavl region: basic data (the state vetriables’ values refer to the end of 
1995). 
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Fourth scenaxio (lY) accounts for the dependence of model peirameters 
on r including positive dependence of innovative matrices on culture and 
education. Optimization is performed by the universal improvement 
algorithm. 

Fifth scenario (Y) is to estimate the effect of price change due to 
changing the regional compatibility. 
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Figure 6 Computational results for first scenario 




Figure 7 Computational result for the second scenario 



Some of the results of model calculations for the first three scenarios 
are shown on Figiures 6-8. The initial conditions for them were obtained 
via calculations according scenario (I) for 1995- 1998. In the whole 
they are confirmed by preliminary statistical data for 1998, which is 
insufficient to recalculate strictly into the required set of data. 

As is seen from results for scenario 2, only second aggregated branch of 
economy is productive and works at the full capacity and with maximal 
investment in the optimal mode; two others branches are nonproductive; 
they should work at minimal admissible (from employment considera- 
tions) level and without investment. This means that at present time 
the regional system is structurally ineffective. 




330 ARCHITECTURE OF SCIENTIFIC SOFTWARE 



To overcome this drawback in acceptable time horizon active innova- 
tive policy is required. As is represented by scenario (III) it consists of 
radical redistribution the investment to come in favor of the innovative 
sector which functioning is aimed first of all to improve the structmral 
eflSciency parameters, that is elements of matrices A, C, A^. As a re- 
sult after 5 years all the three aggregated economy branches become 
productive, that is structural efficiency is attained. After that the in- 
vestments axe redistributed gradually in favor of economic sector and 
capacities supporting the improvement of most negative ecological and 
social characteristics. 

4. CONCLUSION 

The model presented and numerical experiment show the peculiarities 
of the research work with big complicated objects such as a region. Thus 
the standard approaches and methods of investigation of such objects 
are not effective. The problem is not only the computational complexity 
but also the complicated object structure, large quantity of interacting 
components and nonlinear dependencies between them. These features 
highly complicate the work of a researcher and the software proposed has 
to give him such object representation that simplifies its understanding 
and helps to work with it. 

Another feature is the experimental character of computations. Thus 
the software architecture should have mechanisms to trace and change 
almost all parameters of the object under investigation at any time, 
moreover it has to be open for entering of new processing units and with 
installing. 

The fulfillment of these software requirements allow one to achieve 
considerable results in investigation. Thus the software architecture pro- 
posed can find many applications in different areas. 
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DISCUSSION 

Speaker: Dimitry Belyshev 

Margaret Wright : Who are the users of this system? What is their 
level of expertise? 

Dimitry Belyshev : The system can be used for more than just re- 
gional modeling. We can predict the development of business and take 
the time period not a year, but a month or a week. In this case the sys- 
tem can be used for estimation of management strategy by businessmen. 
Ivor Philips : How do you verify the correctness of the mathematical 
model since it is predicting events that won’t be verifiable until 2010? 
Dimitry Belyshev : The correctness of the model is a very important 
part of the investigations. Obviously, we don’t know what will happen in 
2010, but we have full information about our history of development. So 
to verify the model it is enough to begin with historical data and check 
the results in comparison of history. We can also ’’turn the time bax;k” 
ajid use the model to ’’predict the past” and observe the discrepancy 
between our artificial history and real facts. 

Morven Gentleman ; You suggest that this model can be validated 
by using the historical record. However, your forward study compared 
three scenarios: Current, with a new industry, and with optimization. 
Since historically what happened is fixed, how could you compare those 
three scenarios? 

Dimitry Belyshev : Yes, this method can’t be used if we made some 
changes in the model. However, we can validate the basic characteristics 
and mechanisms of the model. It could be enough, because most of our 
optimizations relate to assignment of money between different branches 
of the economy and do not change its parameters. If the economics part 
will work properly in the basic scenario, it will be the same in others. 
Bo Einarsson : Have you looked at the outcome of the prognoses if 
you did not include the social and ecological aspects? 

Dimitry Belyshev : Yes. The regional profits became close to the real 
situation. They are not very big, but not negative. 

Bo Einarsson : When you made the backwards prognoses, did you 
then consider the discontinuity in your country? 

Dimitry Belyshev : No. We only try to model the economical pro- 
cesses, but we can’t predict our politics. The historical investigations 
can be made only in a stable period of economy. 
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Abstract Java is quickly becoming the most popular platform for distributed com- 
puting. However, its performance is still subject to concerns in com- 
parison to other programming languages such as C and Fortran. As a 
consequence, programmers of high-performance applications are usually 
reluctant to embrace Java as an alternative language in their work. This 
article introduces the Java-to-C Interface (JCI) tool which generates au- 
tomatically the wrapper code interfacing existing scientific libraries to 
Java. Thus, facilitating rapid development and software reuse, the JCI 
tool provides application programmers with immediate accessibility to 
existing scientific libraries from Java. While beneficial to the software 
developer, the additional advantages of mixed-language programming 
in terms of application performance are addressed in detail within the 
context of this work. We also present analysis and comparisons of eval- 
uation results for mixed-language codes in Java and C/Fortran on a 
high-performance distributed memory computer (IBM SP-2). The NAS 
Embarrassingly Parallel and Integer Sort benchmarks as well as the Ma- 
trix Multiplication kernel from the PARKBENCH suite were selected 
for our experiments. The evaluation results demonstrate the feasibility 
and efficiency of our mixed-language programming methodology with 
Java. 
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1. INTRODUCTION 

One of the problems facing the Java programming language and its ac- 
ceptance for scientific computing is performance. In general, the fundar 
mental trade-off between portability and performance is very well known 
to the high-performance computing community. The Java language de- 
signers placed an emphasis on portability (and in particular, mobility) 
of code in favour of performance. This is one of the main reasons why 
making Java programs run fast is not an easy task. 

A closer inspection shows that the Java platform has several built-in 
mechanisms which allow the parallelism inherent in scientific programs 
to be exploited. Threads and concurrency constructs are well-suited to 
shared memory computers, but not large-scale distributed memory ma- 
chines. Although sockets and the Remote Method Invocation (RMI) in- 
terface allow network programming, they are rather low-level to be suit- 
able for the Single-Program-Multiple-Data (SPMD) parallel program- 
ming model. Therefore, codes based on them would potentially un- 
derperform platform-specific high-performance implementations of stan- 
dard scientific and communication libraries. 

Nevertheless, as a programming language, Java has the core qualities 
needed for writing high-performance applications. With the maturing 
of compilation technology, such applications written in Java are starting 
to appear. Fortunately, rapid progress is being made in this area by 
developing static Java compilers, such as the IBM High-Performance 
Compiler for Java (HPCJ), which generates optimized native code for 
the RS6000 architecture [11]. Since the Java language is relatively new, 
however, it lacks the extensive scientific libraries of other languages such 
as Fortran and C. This is one of the major obstacles towards eflScient 
and user-friendly computationally intensive programming in Java. 

Standard libraries often used for high-performance scientific comput- 
ing include the Message Passing Interface (MPI), and the Scalable Lin- 
ear Algebra PACKage (ScaLAPACK). Providing access to such libraries 
seems imperative if Java is to achieve the success of Fortran and C in sci- 
entific programming. Access to standard libraries is essential not only for 
performance reasons, but also for software engineering considerations. If 
available, it would allow the wealth of existing Fortran and C code to 
be reused at virtually no extra cost when writing new applications in 
Java. In order to overcome these problems, we have applied oxn JCI 
code generating tool to create Java bindings for various legaey libraries 

[7]. 

In this article we first describe the design principles of the JCI tool. 
We also introduce our methodology for mixed-language software devel- 
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opment with Java and demonstrate the viability of our approach on a 
number of performance evaluation experiments. 

2. THE JCI TOOL 

At first sight it appears that the binding of a native library to Java 
should not be a problem, as Java implementations support the Java Na- 
tive Interface (JNI) via which C functions or Fortran subroutines can 
be called [9]. There are some hidden problems, however. Complications 
stem firom the fact that Java data formats are in general difierent firom 
those of other languages like C, C-l— 1-, Fortran, etc. This obviously re- 
quires data conversion of both arguments and results in mixed-language 
applications. Such conversion is a natural part of the native code if 
both parts of a mixed-language piece of software are to be written firom 
scratch. For legacy codes, however, an additional interface layer called 
binding or wrapper must be created which performs data conversion and 
other auxiliary functions if necessary. 

In principle, the binding of a native library to Java amounts to either 
dynamically linking the library to the Java Virtual Machine (JVM), or 
linking the library to the object code produced by a static Java compiler. 
Binding a legacy library to Java may also be accompanied by portability 
problems as the JNI specification is still not fully supported in difier- 
ent Java implementations. Thus, in order to maintain the portability of 
the binding one may have to cater for a variety of native interfaces. A 
large legacy library like MPI, for example, can have over a hundred ex- 
ported functions. Therefore, the JCI tool which generates automatically 
the additional interface layer plays central role in our mixed-language 
programming methodology. In order to call a C function from Java, 
the JCI tool has to supply for each formal argument of the C function 
a corresponding actual argument in Java. Unfortunately, the disparity 
between data layout in the two languages is large enough to rule out a 
direct mapping in general. For instance, one has to take into account 
that: 



■ primitive types in C may be of varying sizes, difierent from the 
standard Java sizes; 

■ there is no direct analog to C pointers in Java; 

■ multi-dimensional arrays in C have no direct counterpart in Java; 

■ C structures can be emulated by Java objects, but the layout of 
fields of an object may be difierent from the layout of a C structure; 
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Table 1 Mapping of compound C types into Java types 



C type 


Java type 


Comment 


char * 


ObjectOfCheur 


- if at top level and not 
the type of a function; 




String 


- otherwise. 


struct name * 


class name 




void * 


Object 




c.type * 


ObjectOf 




char [] 


String 




cJype [] 


jjype [] 




struct name 


class name 





■ C functions passed as arguments have no direct counterpart in 
Java. 

Therefore, one has to define a specific mapping which is then imple- 
mented by the JCI tool. Table 1 shows the scheme currently used to map 
C types onto Java types. Primitive types are not listed in this table be- 
cause they are to be found in the documentation of each JVM’s native 
interface. C pointers are represented in a type-safe way by a family of 
Java classes generated by JCI. Each such class is named ObjectOfj.tj/pe, 
and contains a field val of type j-type. Pointer objects can be created and 
initialized by Java constructors, or by the overloaded function JCI .ptr. 
They can be dereferenced by accessing the val field. In general, the 
defined mapping is not unique - on the contrary - there is a number 
of different mappings to choose from. The selection of an appropriate 
mapping represents an important trade-off between the extent of the 
performance overhead introduced by the binding on the one hand, and 
the ease of use of the programming interface from Java on the other. 

A block diagram of JCI is shown in Figure 1. The tool takes as input 
a header file containing the C function prototypes of the native library. 
It outputs a number of files comprising the additional interface: a file 
of C stub-functions, files of Java class and native method declarations, 
and shell scripts for compiling and linking the binding. The JCI tool 
generates a C stub-function and a Java native method declaration for 
each exported function of the native library. Every C stub-function 
takes arguments whose types correspond directly to those of the Java 
native method, and converts the arguments into the form expected by 
the C library function. 
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Figure 1 JCI block diagram 



Thanks to the JCI tool our bindings are easily adaptable to various 
platforms. As we mentioned already, different Java native interfaces 
exist, and thus separate code may have to be generated for binding a 
given legacy library to different Java implementations. We have tried to 
limit the dependence of JCI’s output on the native interface version to a 
set of macro definitions describing the particular native interface. Thus, 
it may be possible to re-bind a library to a new Java platform simply by 
providing the appropriate macros. 

The tool also provides a good deal of flexibility for generating Java 
wrappers to native libraries. For example, by using different library 
header files as input, one can create bindings for multiple versions of a 
library, such as MPI-1.1, MPI-1.2, MPI-2.0. Furthermore, JCI can be 
used to generate Java bindings for libraries written in languages other 
than C, provided that the library can be linked to C programs, and 
prototypes for the library functions are given in C. This is how we have 
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Table 2 Legacy libraries bound to Java 



Size of Java binding 



library 


written in 


functions 


C lines 


Java lines 


MPI 


C 


125 


4434 


439 


BLACS 


C 


76 


5702 


489 


BLAS 


F77 


21 


2095 


169 


PBLAS 


C 


22 


2567 


127 


PB-BLAS 


F77 


30 


4973 


241 


LAPACK 


F77 


14 


765 


65 


ScaLAPACK 


F77 


38 


5373 


293 



created Java bindings for the ScaLAPACK constituent libraries written 
in Fortran- 77: BLAS Level 1-3, PB-BLAS, LAPACK, and SceiLAPACK 
itself [7]. The C prototypes for the Fortran library functions have been 
inferred following the methodology adopted in the Fortran-to-C trans- 
lator [5]. 

Table 2 gives some idea of the sizes of JCI-generated bindings for in- 
dividual libraries. In addition, there are some 2280 lines of Java class 
declarations produced by JCI which are common to all cases. The au- 
tomatically generated bindings are fairly large in size because they are 
meant to be portable, and to support different data formats. On a par- 
ticular hardware platform and JNI implementation, much of the binding 
code may be eliminated during the preprocessing phase of its compila- 
tion. 

Even though JCI does a lot to smooth out the interface between Java 
and legacy codes, calling native library functions may not be as straight- 
forward and elegant as calling Java functions. Some peculiarities and 
difficulties encountered while writing Java programs which access native 
libraries are listed below. 

Pointers/addresses. A pointer to a value of type j-type is represented 
in JCI-generated bindings as a class with a single field val of type 
j-type. Pointer objects can be created and initialized by Java con- 
structors, or by the overloaded function JCI.ptr. They can be 
dereferenced by accessing the val field. In eiddition, there is some 
specific peculicirity when accessing a Fortran native library because 
arguments in Fortran are always passed by reference. Therefore, all 
scalar arguments to a Fortran native function must be enclosed in 
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pointer objects, regeirdless of whether they axe intended for input 
or output of values. 

Array offsets. In both C and Fortran, one can pass the address of an 
array element as an actual argument to a function or subroutine. 
This is not possible in Java. Subsequently, a Java program can- 
not pass part of an array starting at a certain offset to a (native 
library) function. One way round this restriction is to add one in- 
teger “offset” argument for each array argument of a function [2]. 
The JCI-generated wrapper code supports a more elegant solution 
as well, which does not involve extra arguments to native library 
functions. The elements of an array arr of any type starting at 
offset * can be passed to a native library function by 

JCI. section (aurr, i) 

where JCI. section is an overloaded method whose definition is 
generated by JCI. For example, passing an array section to a native 
library function can be done by 

bias . idaonax (JCI.ptr(n-k) , JCI.section(col.k, k) , one) 

The array col_k starting at offset k is passed to the BLAS function 
idamax. Type safety with JCI. section is guaranteed, because 
the compiler will check if the array has the required type. This 
example also illustrates one unfortunate consequence of accessing 
a Fortran function as discussed above - all scalars must be passed 
by reference (i.e. be wrapped in objects, for example by JCI .ptr). 

Multi-dimensional arrays. Many scientific library functions take as 
eirguments multi-dimensional arrays such as matrices. The JCI 
tool supports multi-dimensional arrays, but a run-time overhead is 
incurred because such arrays must always be relocated in memory 
in order to be made contiguous before being supplied to a native 
function. When large data arrays are involved the inefficiency can 
be significant. In order to avoid to some extent this problem, 
in our ScaLAPACK library bindings we have chosen to represent 
matrices in Java as one-dimensional arrays. On the other hand, 
in the Java binding for MPI [12] multi-dimensional arrays are left 
intact without significant inefficiency. Large arrays used as data 
buffers can have their layout described by an MPI derived data 
type, and the Java binding performs no conversion for them. Other 
multi-dimensional arrays used in MPI as descriptors are relatively 
small and therefore not important from performance point of view. 
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Array indexing/layout. This problem is specific to native libraries 
written in Fortran, where arrays are normally indexed starting 
from 1, while in Java as in C indices start from 0. Java programs 
calling Fortran native functions that receive or return an array 
index must be aware of the difierence. Another point to bear in 
mind when accessing a Fortran library is the inverse order of array 
layout in comparison with C. 

3. EVALUATION RESULTS 

In this section we present performance analysis and comparisons of 
evaluation results for both Java and C/Fortran on a high-performance 
distributed memory computer (IBM SP-2). The NAS parallel Embar- 
rassingly Parallel (EP) and the Integer Sort (IS) benchmarks were used 
initially in our performance experiments. The EP kernel provides an 
estimate of the upper achievable limits for floating point performance, 
but requires minimal communications. The IS routine evaluates integer 
operations and bi-directional communications when the sorted keys are 
exchanged between nodes. The NAS version of IS is written in C, while 
the EP code is in Fortran. The NAS parallel benchmarks methodology 
specifies several problem sizes called “classes” in order to ensure com- 
parative measurements across difierent platforms and environments. In 
our study, we present evaluation results for class B (2^° data points) of 
the EP kernel and class A (2^^ data points) for the IS benchmark. 

The JVM and the Java compiler used on the IBM SP-2 machine were 
part of the JDK for AIX. The execution environment consisted of IBM’s 
Parallel Operating Environment (POE), which supports the loading and 
execution of parallel processes across the nodes of the IBM SP-2. The 
machine is built of thin nodes with Power-2 Super Chip (P2SC) pro- 
cessors and 256 Mbytes of memory on each node. The communication 
subsystem of the SP-2 features a high-performance switch which was 
used throughout the experiments. The message-passing library we have 
used with Java is the Local Area Multiprocessor (LAM) implementation 
of MPI from the Ohio Supercomputer Center [3]. Performance measure- 
ments for the corresponding Fortran or C code under both LAM and 
IBM’s native MPI implementation are also given for comparison. 

The evaluation results for the EP kernel (Figure 2) show good seal- 
ability for up to 128 nodes on the SP-2. The substantial difference is, 
however, the fact that benchmarks using LAM MPI for message-passing 
run approximately 2.5 times slower in Java than their corresponding 
Fortran counter part. This is not a surprise, but shows the performance 
penalty that one should expect from a direct port of computationally 
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Figure 2 Execution times for the NPB EP kernel (class B) on the IBM SP-2 



intensive code to Java. Of course, this is not the best mixed-language 
performance one can obtain as demonstrated by our further experiments. 

After the initial period when the first versions of the Java platform 
were built for portability, the Java compiler technology has now entered 
a second phase where the new versions are also targeting higher per- 
formance. For example. Just-in-time (JIT) compilers have dramatically 
improved their efficiency, and are now challenging mature C-1— I- compil- 
ers. Furthermore, to gain even faster execution times, the developers of 
HPCJ have adopted the static compilation approach [11]. Their compiler 
which generates native code for the RS6000 architecture was also used 
in this evaluation in order to compare the conventional native execution 
with the interpreted execution provided by JVMs. 

Performance evaluation experiments with both the original C and the 
Java versions of the IS kernel were carried out on the IBM SP-2 machine. 
The results obtained are shown in Figure 3. When using the JVM for 
AIX for interpreted execution with the JIT compiler enabled, the Java 
IS benchmark is around two times slower than the original C code. In 
order to gain a more detailed insight and to ensure fair comparisons, 
we have run the C code with both the native IBM and LAM MPI im- 
plementations. As expected, the LAM-based experiment is slower but 
provides a basis for comparison with the Java version of the IS ker- 
nel which also uses LAM for message-passing. The performance of this 
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Figure 3 Execution time for the NPB IS kernel (class A) on the IBM SP-2 



latter code is very impressive when compiled staticly with HPCJ. The 
timing results almost overlap with those delivered by the C version and 
provide evidence that Java can be used successfully in high-performance 
computing. 

Further experiments on the IBM SP-2 were conducted with a Java 
translation of the Matrix Multiplication (MATMUL) benchmark from 
the PARKBENCH suite [13]. The original benchmark is written in 
Fortran- 77 and performs dense matrix multiplication in parallel. It ac- 
cesses the BLAS, BLACS and LAPACK libraries included in the PARK- 
BENCH 2.1.1 distribution. MPI is also used but indirectly through the 
BLACS native library. The default problem size {N) vs N = 1000. 

Changing the balance between the two parts of a given code written 
in both Java and C or Fortran changes also the performance penalty 
for using Java. For example, within the MATMUL benchmark most 
of the performance-sensitive calculations are carried out by the native 
library routines rather than by the Java part of the program. Therefore, 
the Java MATMUL execution times are only 5-10% longer than the 
measurement results obtained with the original Fortran code as shown 
in Figure 4. In both experiments for the above comparison we have 
used LAM as a message-passing environment. Results obtained with the 
original kernel and the native IBM MPI are also given for completeness. 
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Figure 4 Execution time for the PARKBENCH MATMUL kernel on the IBM SP-2 



The observations of the above experiment clearly demonstrate another 
dimension of flexibility for our mixed-language programming methodol- 
ogy. In this case, excellent performance results can be achieved even 
without using a static Java compiler like HPCJ. Instead, the relatively 
small (5-10%) performance penalty is incurred by the interpreted ex- 
ecution using standard JVM with the JIT compiler enabled. Such 
small overhead can be achieved by keeping the calculation-intensive code 
within the native library. 

Thus, one can apply the JCI tool to wrap up the time-consuming part 
of a software system as a native library and implement the rest of it in 
Java. In such cases, the inefiicient interpreted execution of the JVM is 
only used for a front-end Java code that provides coordination functions 
and interactive interfaces. Clearly, our mixed-language programming 
methodology does not impose any restrictions or requirements regarding 
the implementation level of the wrapper code. This gives the flexibility 
to select the most appropriate and efiicient balance of different program- 
ming languages within each individual software development project. 

4. DISCUSSION AND RELATED WORK 

Many research groups and vendors are pursuing research to improve 
Java’s performance which would enable more scientific and engineering 
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applications to be solved on Java platforms. The need for access to 
legacy libraries is one of the burning problems in this area. Several 
approaches can be taken in order to make the libraries available from 
Java: 



■ Rewriting by hand existing libraries in Java. Considering the size 
of the available codes and the number of years that were invested 
in their development, rewriting the libraries would require an enor- 
mous amount of manual work [2]. 

■ Automatically translating Fortran or C libraries into Java. We 
are aware of two groups that have been working in this area - 
University of Tennessee [4] and Syracuse University [6]. This ap- 
proach offers an important long-term perspective as it preserves 
Java portability, while achieving high performance in this case 
would obviously be more difficult. 

■ Manually or automatically creating a Java wrapper for an existing 
native Fortran or C library. Obviously, by binding legacy libraries, 
Java programs can gain in performance on all those hardware plat- 
forms where the libraries are efficiently implemented. The price to 
be paid for this clear advantage, however, is the use of native code 
which breaks the Java security model and does not allow work with 
applets. 

The automatic binding, which we are primarily interested in, has the 
obvious advantage of involving the least amount of work, thus reduc- 
ing dramatically the time for development. Moreover, it guarantees 
the best performance results, at least in the short term, because the 
well-established scientific libraries usually have multiple implementa- 
tions carefully tuned for maximum performance on different hardware 
platforms. Last but not least, by applying the software re-use tenet, 
each native legacy library can be linked to Java without any need for 
re-coding or translating its implementation. 

While automatic binding is certainly convenient, sometimes the data 
conversion may impose a bigger performance penalty. As described in 
section 2 we have addressed several issues potentially contributing to a 
bigger time overhead of our mixed-language programming approach. As 
a result of that, our experiments on IBM SP-2 machines have shown a 
negligible amount of time spent in the binding itself during the execution 
of Java programs. 

One of the primary goals of omr approach has been to gain faster 
execution times by using Java and legacy scientific libraries written in 
C or Fortran without sacrificing performance from the available highly 
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optimized native code. The use of the JCI tool clearly extends Java’s 
usefulness and provides rapid solution to the mixed-language interfac- 
ing problem, but the JNI-wrapping techniques introduce certain limita- 
tions on application portability and mobility. One possible solution to 
this problem can be achieved by extending the functionality within the 
boundaries of a metacomputing environment [8]. 

5. CONCLUSIONS 

This article presents a general approach to combine Java and exist- 
ing code written in Fortran and/or C into mixed-language applications 
where Java serves as a front-end component for legacy native libraries. 
We also show that with these existing performance-tuned libraries al- 
ready available on different platforms and the wrapper interfaces gen- 
erated by the JCI tool, one can build different kinds of mixed-language 
software systems for high-performance Java computing in a flexible and 
elegant way. The JCI tool for automatic creation of interfaces to such 
libraries (whether for scientific computation or message-passing) plays 
central role in our mixed-language programming methodology. 

In addition to the JCI-generated bindings, other basic components 
used in our high-performajice Java programming methodology include 
performance-tuned implementations of scientific and communications li- 
braries available on different machines, and a native Java compiler such 
as IBM’s HPCJ. We also believe that our approach is practical in a sense 
that legacy code is ubiquitous and it would be much too tedious to port 
all of it to Java. If Java is to gain acceptance as a high-performance 
language it has to interface with such existing libraries. 
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DISCUSSION 

Speaker: Vladimir Getov 

Scott Kohn : What axe the memory costs and overheads for using Java 
for HPC? 

Vladimir Getov : This is an interesting but a rather general question. 
There is a number of issues related to the memory costs and overheads 
when using Java. First of all, e£ich JVM has its own memory require- 
ments that come in addition to the memory needed by the operating 
system. Subsequently, the remaining memory available for applications 
is smaller in comparison to the conventional case of static compilation, 
including the use of native Java compilers such as HPCJ. For Java appli- 
cation codes in particular running within a JVM, the available memory 
is defined by the allocated heap size. Tuning the heap size for bigger 
applications may turn out to be very important in order to utilize the 
available RAM efficiently. When using the JCI tool, one has to take into 
account also the JNI overhead and the linking of the specified native 
libraries at run-time. The wrapper software overhead is relatively very 
small and can be neglected. In most of the cases the memory costs and 
overheads vary significantly between different vendors and versions of the 
Java platform. Therefore, quoting quantitative results should always be 
accompanied with information about the product, version, release, etc. 
Morven Gentleman : Does JCI generate wrappers that can accom- 
modate the need of legacy libraries that require typeless containers, e.g., 
to support persistent data lifetimes across reverse communication calls? 
Vladimir Getov : The JCI tool generates the wrapper code on the 
basis of mapping between various data types and structures between 
two given target languages. This mapping can be changed by the user 
depending on the specific characteristics of the two programming lem- 
guages and the requirements of the application area. For example, we 
have used three different mappings so far, but none of them accommo- 
dates typless containers. However, typeless containers can be included 
into a new mapping definition for automatic generation of wrappers to 
relevant legacy libraries. 
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THE ARCHITECTURE OF SCIENTIFIC 
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In this chapter we provide some background information on the con- 
ference that broght together the resejirchers whose work is described in 
this volume. 

1. CONFERENCE VENUE 

The conference Software Architectures for Scientific Computing Ap- 
plications was held in Ottawa, Ontario, Canada on October 2-4, 2000. 
This was the eight working conference of the International Federation for 
Information Processing (IFIP) Working Group on Numerical Software 
(WG2.5) on behalf of the IFIP Technical Committee on Software: The- 
ory and Practice (TC 2). The conference was held on the campus of the 
National Research Council (NRC) of Canada in Ottawa, where it was 
hosted by the Institute for Information Technology. The 38 attendees of 
the focused workshop heard 20 invited presentations. Considerable time 
was allotted for discussion of issues raised during the talks, and the bulk 
of this discussion is recorded in this volume. 

2. THE TECHNICAL PROGRAM 

Monday, October 2, 2000 

09:00 Network-based Scientific Computing. Elias Houstis (Purdue 
University) 

09:45 Broadway: Software Architecture for Scientific Computing. 

Calvin Lin (University of Texzis at Austin) 

10:30 Break 

11:00 Data Management Systems for Scientific Applications. Rea- 
gan Moore (San Diego Supercomputer Center) 

11:45 Developing Architecture to Support the Implementation and 
Development of Scientific Computing Applications. Jack 
J. Dongarra (University of Tennessee) 

12:30 Lunch 




352 ARCHITECTURE OF SCIENTIFIC SOFTWARE 



14:00 PETSc and Overture: Lessons Learned Developing an In- 
terface Between Components. Kristopher Buschelman (Ar- 
gonne National Laboratory) 

14:45 On the Role of Mathematical Abstractions for Scientific 
Computing. Krister Ahlander (University of Bergen) 

15:30 Break 

16:00 The Virtual Testing Facility Application Framework. 
J. T. C. Pool (California Institute of Technology) 

Tuesday, October 3, 2000 

09:00 Formal Methods for High-Performance Linear Algebra Li- 
braries. Robert van de Geijn (University of Texas at Austin) 
09:45 Software Components for Application Development. Anne 
Trefethen (NAG) 

10:30 Break 

11:00 Component Technology for High-Performance Scientific 
Simulation Software. Scott Kohn (Lawrence Livermore Na- 
tional Laboratory) 

11:45 A New Approach to Software Integration Frameworks for 
Multi-Physics Simulation Codes. Milind Bhandarkar (Uni- 
versity of Illinois) 

12:30 Lunch 

14:00 Hierarchical Representation and Computation of Approxi- 
mate Solutions in Scientific Simulations. Wayne Enright 
(University of Toronto) 

14:45 Software Architecture for the Investigation of Controllable 
Models with COmplex Data Sets. Dmitry V. Belyshev (Pro- 
gram Systems Institute, Russian Academy of Sciences) 

15:30 Break 

16:00 Multi-Language Programming Methodology for High- 
Performance Java. Vladimir Getov (University of West- 
minster and Los Alamos National Laboratory 

Wednesday, October 4, 2000 

09:00 A Comprehensive DFT API for Scientific Computing. Ping 
Tak Peter Tang (Intel Corporation) 

09:45 New Generalized Data Structure for Matrices Leading to a 
Variety of High-Performance Algorithms. Fred Gustavson 
(IBM) 

10:30 Break 
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11:00 A Collaborative Code Development Environment for Com- 
putational Electromagnetics. David W. Walker (Oak Ridge 
National Laboratory) 

11:45 Code Coupling Using Parallel CORE A Objects. Christophe 
Rene (IRISA/IFSIC) 

12:30 Lunch 

14:00 Object-oriented Modeling of Parallel PDE Solvers. Michael 
Thune (Uppsala University) 

14:45 A Fortran Interface for POSIX Threads. Richard Hanson 
(Rice University) 

3. ORGANIZATION 

The Program Committee consisted of Ronald Boisvert, Wayne En- 
right, Stuart Feldman, Brian Ford, Patrick Gaffney, Morven Gentleman 
(co-chair), Ian Gladwell, Eric Grosse, Pieter Hemker, Richard Kaaman, 
Mo Mu, James C. T. Pool (co-chair), John Rice, Mary Shaw, Brian 
Smith, Peter Tang, and Mladen Vouk. 

Morven Gentleman served as local arrangements chair. Roger Impey 
hosted the conference for the NRC, and Nicole Sarault and Guylaine 
Caron assisted. 

The conference was supported with grants from IBM Canada Ltd., 
the Hewlett-Packard Company, Intel Corporation, and the National Re- 
search Council of Canada. 
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5. IFIP WORKING GROUP 2.5 

In 1974 IFIP established the Working Group on Numerical Software 
under the auspices of its Technical Committee on Software: Theory and 
Practice. The group started with 13 members and has since grown to a 
membership of more than 30. 

The aim of Working Group 2.5 is to improve the quality of numerical 
computation by promoting the development and availability of sound 
numerical software. Objectives within the scope of the Working Group 
2.5 are as follows. 



■ The definition from a numerical standpoint of a set of hardware 
and software features for a computing system. 



The development and improvement of programming languages for 
numerical computation. 
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■ The establishment of guidelines for comparison of subroutines from 
different numericail program libraries. 

■ The establishment of guidelines for documentation, testing, distri- 
bution and maintenance of numerical program libraries. 

■ The exchange of information concerning numerical software and 
determination of the needs of computer users. 

The following mode of work has been established within WG 2.5. 

1 Working group meetings 

The group meets roughly once a year. (There have been 27 meet- 
ings so far). At these meetings, areas of activity and the proper 
meeins of achieving results are defined. Reports on current or com- 
pleted activities are given and discussed. 

2 Projects 

Most activities take the form of projects. One or several members 
of the group assume the responsibility to pursue a given subject 
matter in collaboration with other scientists in the field. Results 
of projects are either published through standard channels or, in 
special cases, may take the form of an IFIP-document. 

3 Working Conferences 

This is a standard means of IFIP-activity; 30-70 experts are invited 
to meet to discuss and advance a narrow technical subject area. 
The proceedings of a working conference appear as a book. 

For further information, see http://www.nsc.liu.se/wg25.html . 
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